Realign indexes to a changed python collection - python

I have a collection of data and a variable containing indexes to some of them.
A filtering operation is applied on the data that eliminates a subset of the data.
I want to shift the indexes so that they refer to the updated collection of data (eliminating indexes to deleted instances).
I'm using the implementation in the function below. I'm also posting the code I used to validate that it works.
Is there a quick & fast way to do the index realignment via the core libraries or a better way in general?
import random
def align_index(wanted_idx, mask):
"""
Function to align a set of indexes to a collection after deletions,
indicated with a mask
Arguments:
wanted_idx: List of desired integer indexes prior to deletion
mask: Binary mask, where 1's indicate elements that survive deletion
Returns:
List of integer indexes to (surviving) desired elements, post-deletion
"""
# rebuild indexes: remove dangling
new_idx = [idx for (i, idx) in enumerate(wanted_idx) if mask[idx]]
# mark deleted
not_mask = [int(not m) for m in mask]
# cumsum deleted regions
realigned_idx = [k-sum(not_mask[:k+1]) for k in new_idx]
return realigned_idx
# data
data = [random.randint(0,500) for _ in range(1000)]
rng = list(range(len(data)))
for _ in range(1000):
# random data deletion / request
wanted_idx = random.sample(rng, random.randint(5,100))
del_index = random.sample(rng, random.randint(5, 100))
# apply deletion
mask = [int(i not in del_index) for i in range(len(data))]
filtered_data = [data[i] for (i, m) in enumerate(mask) if m]
realigned_index = align_index(wanted_idx, mask)
# verify
new_idx = [idx for (i, idx) in enumerate(wanted_idx) if mask[idx]]
l1 = [data[k] for k in new_idx]
l2 = [filtered_data[k] for k in realigned_index]
assert l1 == l2

If you use numpy it's quite trivial:
import numpy as np
mask = np.array(mask, dtype=np.bool)
new_idx = np.cumsum(mask, dtype=np.int64)
new_idx[mask] = -1
You shouldn't need to recompute new_idx unless more elements get deleted.
Then you can get the remapped index for old index i just by looking new_idx[i]. Or a whole array at once:
wanted_idx = np.array(wanted_idx, dtype=np.int64)
remapped_idx = new_idx[wanted_idx]
Note that deleted indices get assigned value -1. You can filter these out if you want:
remapped_idx = remapped_idx[remapped_idx >= 0]

Related

Aid in Understanding Pre-Processing for Facebooks DLRM Code

Hello I am looking to understand how facebook pre-processes their data, and I think I have a basic understanding it seems Facebook has two functions to pre-process data
They convert Ustrings into distinct integers. Both functions are similar in that they create an empty array and look through the data to find how many times a distinct integer repeats.The difference between the two functions is that one will return the occurences, and the second will return the indices for unique distinct integers.
I was hoping maybe someone can give me a better understand or see if there was something I may have missed
Link to Code: https://github.com/facebookresearch/dlrm/blob/11afc52120c5baaf0bfe418c610bc5cccb9c5777/data_utils.py#L51
First Function
def convertUStringToDistinctIntsDict(mat, convertDicts, counts):
# Converts matrix of unicode strings into distinct integers.
#
# Inputs:
# mat (np.array): array of unicode strings to convert
# convertDicts (list): dictionary for each column
# counts (list): number of different categories in each column
#
# Outputs:
# out (np.array): array of output integers
# convertDicts (list): dictionary for each column
# counts (list): number of different categories in each column
# check if convertDicts and counts match correct length of mat
if len(convertDicts) != mat.shape[1] or len(counts) != mat.shape[1]:
print("Length of convertDicts or counts does not match input shape")
print("Generating convertDicts and counts...")
convertDicts = [{} for _ in range(mat.shape[1])]
counts = [0 for _ in range(mat.shape[1])]
# initialize output
out = np.zeros(mat.shape)
for j in range(mat.shape[1]):
for i in range(mat.shape[0]):
# add to convertDict and increment count
if mat[i, j] not in convertDicts[j]:
convertDicts[j][mat[i, j]] = counts[j]
counts[j] += 1
out[i, j] = convertDicts[j][mat[i, j]]
return out, convertDicts, counts
Second Function
def convertUStringToDistinctIntsUnique(mat, mat_uni, counts):
# mat is an array of 0,...,# samples, with each being 26 categorical features
# check if mat_unique and counts match correct length of mat
if len(mat_uni) != mat.shape[1] or len(counts) != mat.shape[1]:
print("Length of mat_unique or counts does not match input shape")
print("Generating mat_unique and counts...")
mat_uni = [np.array([]) for _ in range(mat.shape[1])]
counts = [0 for _ in range(mat.shape[1])]
# initialize output
out = np.zeros(mat.shape)
ind_map = [np.array([]) for _ in range(mat.shape[1])]
# find out and assign unique ids to features
for j in range(mat.shape[1]):
m = mat_uni[j].size
mat_concat = np.concatenate((mat_uni[j], mat[:, j]))
mat_uni[j], ind_map[j] = np.unique(mat_concat, return_inverse=True)
out[:, j] = ind_map[j][m:]
counts[j] = mat_uni[j].size
return out, mat_uni, counts

Split a nested list in equally sized parts

I would like to separate a list of 80 sets of coordinates in backets of 8 sets each.
This is what I tried. I found the indexes where the backets start. I sliced the list of coordinates between one index and the next. FInally, I used an if statement to create the final backet, since there is no 'next' index for the last index. Any ideas to improve this approach?
Thank you.
nested_lst = [[0.5, 11.3, 5.1]]*80
indexes = list(range(len(nested_lst)))[::8]
backets = []
for i in range(len(indexes)):
if i != len(indexes) - 1:
backet = nested_lst[indexes[i]:indexes[i+1]]
else:
backet = nested_lst[indexes[i]:]
backets.append(backet)
The coordinates list can be flattened, and a simple iteration should work.
coordinates = [i for i in coordinates]
backets = []
i = 0
while(i < len(coordinates)):
l = []
for _ in range(8):
l.append(coordinates[i])
i += 1
backets.append(l)
Could this work for you? Reference answer here
def batch(iterable, n=1):
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]
nested_lst = [[0.5, 11.3, 5.1]]*80
backets = list(batch(nested_lst, n=8))
print(backets)
The results are matching yours, but this might be a more efficient and better-looking way to do it
Use numpy! It'll be much faster and simpler
import numpy as np
coordinates = np.array([0.5, 11.3, 5.1]*80)
# Number of buckets
chunks = 8
# It is a must to have equally sized buckets, but reshape will also fail is this is not ok...
assert(coordinates.size % chunks == 0)
# Size of each bucket
chunksize = coordinates.size // chunks
# There you go, 8 buckets with the same size
res = coordinates.reshape((chunks, chunksize))

How to find steps in a vector (1d array, list) in Python?

I want to get border of data in a list using python
For example I have this list :
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
I want a code that return data borders. for example:
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
^ ^ ^ ^
b = get_border_index(a)
print(b)
output:
[0,4,7,12]
How can I implement get_border_index(lst: list) -> list function?
The scalable answer that also works for very long lists or arrays is to use np.diff. In that case you should avoid a for loop at all costs.
import numpy as np
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
a = np.array(a)
# this is unequal 0 if there is a step
d = np.diff(a)
# boolean array where the steps are
is_step = d != 0
# get the indices of the steps (first one is trivial).
ics = np.where(is_step)
# get the first dimension and shift by one as you want
# the index of the element right of the step
ics_shift = ics[0] + 1
# and if you need a list
ics_list = ics_shift.tolist()
print(ics_list)
You can use for loop with enumerate
def get_border_index(a):
last_value = None
result = []
for i, v in enumerate(a):
if v != last_value:
last_value = v
result.append(i)
return result
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
b = get_border_index(a)
print(b)
Output
[0, 4, 7, 12]
This code will check if an element in the a list is different then the element before and if so it will append the index of the element to the result list.

Creating 'n' combinations randomly from multiple lists

def models():
default = [0.6,0.67,2.4e-2,1e-2,2e-5,1.2e-3,2e-5]
lower = [np.log10(i/10) for i in default]
upper = [np.log10(i*10) for i in default]
n = 5
a = np.logspace(lower[0],upper[0],n)
b = np.logspace(lower[1],upper[1],n)
c = np.logspace(lower[2],upper[2],n)
d = np.logspace(lower[3],upper[3],n)
e = np.logspace(lower[4],upper[4],n)
f = np.logspace(lower[5],upper[5],n)
g = np.logspace(lower[6],upper[6],n)
combs = itertools.product(a,b,c,d,e,f,g)
list1 = []
for x in combs:
x = list(x)
list1.append(x)
return list1
The code above returns a list of 5^7 = 78,125 lists. Is there a way I can combine items in a,b,c,d,e,f,g, possibly randomly, to create a list of say, 10000, lists?
You could take random samples of each array and combine them, especially if you don't need to guarantee that specific combinations don't occur more than once:
import numpy as np
import random
def random_models(num_values):
n = 5
default = [0.6, 0.67, 2.4e-2, 1e-2, 2e-5, 1.2e-3, 2e-5]
ranges = zip((np.log10(i/10) for i in default),
(np.log10(i*10) for i in default))
data_arrays = []
for lower, upper in ranges:
data_arrays.append(np.logspace(lower, upper, n))
results = []
for i in xrange(num_values):
results.append([random.choice(arr) for arr in data_arrays])
return results
l = random_models(10000)
print len(l)
Here's a version that will avoid repeats up until you request more data than can be given without repeating:
def random_models_avoid_repeats(num_values):
n = 5
default = [0.6, 0.67, 2.4e-2, 1e-2, 2e-5, 1.2e-3, 2e-5]
# Build the range data (tuples of (lower, upper) range)
ranges = zip((np.log10(i/10) for i in default),
(np.log10(i*10) for i in default))
# Create the data arrays to sample from
data_arrays = []
for lower, upper in ranges:
data_arrays.append(np.logspace(lower, upper, n))
sequence_data = []
for entry in itertools.product(*data_arrays):
sequence_data.append(entry)
results = []
# Holds the current choices to choose from. The data will come from
# sequence_data above, but randomly shuffled. Values are popped off the
# end to keep things efficient. It's possible to ask for more data than
# the samples can give without repeats. In that case, we'll reload
# temp_data, randomly shuffle again, and start the process over until we've
# delivered the number of desired results.
temp_data = []
# Build the lists
for i in xrange(num_values):
if len(temp_data) == 0:
temp_data = sequence_data[:]
random.shuffle(temp_data)
results.append(temp_data.pop())
return results
Also note that we can avoid building a results list if you make this a generator by using yield. However, you'd want to consume the results using a forstatement as well:
def random_models_avoid_repeats_generator(num_values):
n = 5
default = [0.6, 0.67, 2.4e-2, 1e-2, 2e-5, 1.2e-3, 2e-5]
# Build the range data (tuples of (lower, upper) range)
ranges = zip((np.log10(i/10) for i in default),
(np.log10(i*10) for i in default))
# Create the data arrays to sample from
data_arrays = []
for lower, upper in ranges:
data_arrays.append(np.logspace(lower, upper, n))
sequence_data = []
for entry in itertools.product(*data_arrays):
sequence_data.append(entry)
# Holds the current choices to choose from. The data will come from
# sequence_data above, but randomly shuffled. Values are popped off the
# end to keep things efficient. It's possible to ask for more data than
# the samples can give without repeats. In that case, we'll reload
# temp_data, randomly shuffle again, and start the process over until we've
# delivered the number of desired results.
temp_data = []
# Build the lists
for i in xrange(num_values):
if len(temp_data) == 0:
temp_data = sequence_data[:]
random.shuffle(temp_data)
yield temp_data.pop()
You'd have to use it like this:
for entry in random_models_avoid_repeats_generator(10000):
# Do stuff...
Or manually iterate over it using next().

Group by max or min in a numpy array

I have two equal-length 1D numpy arrays, id and data, where id is a sequence of repeating, ordered integers that define sub-windows on data. For example:
id data
1 2
1 7
1 3
2 8
2 9
2 10
3 1
3 -10
I would like to aggregate data by grouping on id and taking either the max or the min.
In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id.
Is there a way I can avoid Python loops and do this in a vectorized manner?
I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.
import numpy as np
def group_min(groups, data):
# sort with major key groups, minor key data
order = np.lexsort((data, groups))
groups = groups[order] # this is only needed if groups is unsorted
data = data[order]
# construct an index which marks borders between groups
index = np.empty(len(groups), 'bool')
index[0] = True
index[1:] = groups[1:] != groups[:-1]
return data[index]
#max is very similar
def group_max(groups, data):
order = np.lexsort((data, groups))
groups = groups[order] #this is only needed if groups is unsorted
data = data[order]
index = np.empty(len(groups), 'bool')
index[-1] = True
index[:-1] = groups[1:] != groups[:-1]
return data[index]
In pure Python:
from itertools import groupby, imap, izip
from operator import itemgetter as ig
print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]
A variation:
print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]
Based on #Bago's answer:
import numpy as np
# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]
# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10 1]
If pandas is installed:
from pandas import DataFrame
df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1 7
# 2 10
# 3 1
I'm fairly new to Python and Numpy but, it seems like you can use the .at method of ufuncs rather than reduceat:
import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)
For example:
data_val = array([ 0.65753453, 0.84279716, 0.88189818, 0.18987882, 0.49800668,
0.29656994, 0.39542769, 0.43155428, 0.77982853, 0.44955868,
0.22080219, 0.4807312 , 0.9288989 , 0.10956681, 0.73215416,
0.33184318, 0.10936647])
ans = array([ 0.98969952, 0.84044947, 0.63460516, 0.92042078, 0.75738113,
0.37976055])
Of course this only makes sense if your data_id values are suitable for use as indices (i.e. non-negative integers and not huge...presumably if they are large/sparse you could initialize ans using np.unique(data_id) or something).
I should point out that the data_id doesn't actually need to be sorted.
with only numpy and without loops:
id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])
# max
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)
# min
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)
# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])
(pd_group.values == np_group.values).all() # TRUE
Ive packaged a version of my previous answer in the numpy_indexed package; its nice to have this all wrapped up and tested in a neat interface; plus it has a lot more functionality as well:
import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)
And so on
A slightly faster and more general answer than the already accepted one; like the answer by joeln it avoids the more expensive lexsort, and it works for arbitrary ufuncs. Moreover, it only demands that the keys are sortable, rather than being ints in a specific range. The accepted answer may still be faster though, considering the max/min isn't explicitly computed. The ability to ignore nans of the accepted solution is neat; but one may also simply assign nan values a dummy key.
import numpy as np
def group(key, value, operator=np.add):
"""
group the values by key
any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on)
returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts
"""
#upcast to numpy arrays
key = np.asarray(key)
value = np.asarray(value)
#first, sort by key
I = np.argsort(key)
key = key[I]
value = value[I]
#the slicing points of the bins to sum over
slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
#first entry of each bin is a unique key
unique_keys = key[slices]
#reduce over the slices specified by index
per_key_sum = operator.reduceat(value, slices)
#number of counts per key is the difference of our slice points. cap off with number of keys for last bin
key_count = np.diff(np.append(slices, len(key)))
return unique_keys, per_key_sum, key_count
names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]
unique_keys, reduced_values, key_count = group(names, values)
print 'per group mean'
print reduced_values / key_count
unique_keys, reduced_values, key_count = group(names, values, np.minimum)
print 'per group min'
print reduced_values
unique_keys, reduced_values, key_count = group(names, values, np.maximum)
print 'per group max'
print reduced_values
I think this accomplishes what you're looking for:
[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]
For the outer list comprehension, from right to left, set(id) groups the ids, sorted() sorts them, for k ... iterates over them, and max takes the max of, in this case, another list comprehension. So moving to that inner list comprehension: enumerate(data) returns both the index and value from data, if id[val] == k picks out the data members corresponding to id k.
This iterates over the full data list for each id. With some preprocessing into sublists, it might be possible to speed it up, but it won't be a one-liner then.
The following solution only requires a sort on the data (not a lexsort) and does not require finding boundaries between groups. It relies on the fact that if o is an array of indices into r then r[o] = x will fill r with the latest value x for each value of o, such that r[[0, 0]] = [1, 2] will return r[0] = 2. It requires that your groups are integers from 0 to number of groups - 1, as for numpy.bincount, and that there is a value for every group:
def group_min(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)[::-1]
result[groups.take(order)] = data.take(order)
return result
def group_max(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)
result[groups.take(order)] = data.take(order)
return result

Categories

Resources