Aid in Understanding Pre-Processing for Facebooks DLRM Code

Aid in Understanding Pre-Processing for Facebooks DLRM Code - python

Hello I am looking to understand how facebook pre-processes their data, and I think I have a basic understanding it seems Facebook has two functions to pre-process data
They convert Ustrings into distinct integers. Both functions are similar in that they create an empty array and look through the data to find how many times a distinct integer repeats.The difference between the two functions is that one will return the occurences, and the second will return the indices for unique distinct integers.
I was hoping maybe someone can give me a better understand or see if there was something I may have missed
Link to Code: https://github.com/facebookresearch/dlrm/blob/11afc52120c5baaf0bfe418c610bc5cccb9c5777/data_utils.py#L51
First Function
def convertUStringToDistinctIntsDict(mat, convertDicts, counts):
# Converts matrix of unicode strings into distinct integers.
#
# Inputs:
# mat (np.array): array of unicode strings to convert
# convertDicts (list): dictionary for each column
# counts (list): number of different categories in each column
#
# Outputs:
# out (np.array): array of output integers
# convertDicts (list): dictionary for each column
# counts (list): number of different categories in each column
# check if convertDicts and counts match correct length of mat
if len(convertDicts) != mat.shape[1] or len(counts) != mat.shape[1]:
print("Length of convertDicts or counts does not match input shape")
print("Generating convertDicts and counts...")
convertDicts = [{} for _ in range(mat.shape[1])]
counts = [0 for _ in range(mat.shape[1])]
# initialize output
out = np.zeros(mat.shape)
for j in range(mat.shape[1]):
for i in range(mat.shape[0]):
# add to convertDict and increment count
if mat[i, j] not in convertDicts[j]:
convertDicts[j][mat[i, j]] = counts[j]
counts[j] += 1
out[i, j] = convertDicts[j][mat[i, j]]
return out, convertDicts, counts
Second Function
def convertUStringToDistinctIntsUnique(mat, mat_uni, counts):
# mat is an array of 0,...,# samples, with each being 26 categorical features
# check if mat_unique and counts match correct length of mat
if len(mat_uni) != mat.shape[1] or len(counts) != mat.shape[1]:
print("Length of mat_unique or counts does not match input shape")
print("Generating mat_unique and counts...")
mat_uni = [np.array([]) for _ in range(mat.shape[1])]
counts = [0 for _ in range(mat.shape[1])]
# initialize output
out = np.zeros(mat.shape)
ind_map = [np.array([]) for _ in range(mat.shape[1])]
# find out and assign unique ids to features
for j in range(mat.shape[1]):
m = mat_uni[j].size
mat_concat = np.concatenate((mat_uni[j], mat[:, j]))
mat_uni[j], ind_map[j] = np.unique(mat_concat, return_inverse=True)
out[:, j] = ind_map[j][m:]
counts[j] = mat_uni[j].size
return out, mat_uni, counts

Related

Torch filter multidimensional tensor by start and end values

I have a list of sentences and I am looking to extract contents between two items.
If the start or end item does not exist, I want it to return a row with padding only.
I already have the sentences tokenized and padded with 0 to a fixed length.
I figured a way to do this using for loops, but it is extremely slow, so would like to
know what is the best way to solve this, probably by using tensor operations.
import torch
start_value, end_value = 4,9
data = torch.tensor([
[3,4,7,8,9,2,0,0,0,0],
[1,5,3,4,7,2,8,9,10,0],
[3,4,7,8,10,0,0,0,0,0], # does not contain end value
[3,7,5,9,2,0,0,0,0,0], # does not contain start value
])
# expected output
[
[7,8,0,0,0,0,0,0,0,0],
[7,2,8,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
]
# or
[
[0,0,7,8,0,0,0,0,0,0],
[0,0,0,0,7,2,8,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
]
The current solution that I have, which uses a for loop. It does not produce a symmetric array like I want in the expected output.
def _get_part_from_tokens(
self,
data: torch.Tensor,
s_id: int,
e_id: int,
) -> list[str]:
input_ids = []
for row in data:
try:
s_index = (row == s_id).nonzero(as_tuple=True)[0][0]
e_index = (row == e_id).nonzero(as_tuple=True)[0][0]
except IndexError:
input_ids.append(torch.tensor([]))
continue
if s_index is None or e_index is None or s_index > e_index:
input_ids.append(torch.tensor([]))
continue
ind = torch.arange(s_index + 1, e_index)
input_ids.append(row.index_select(0, ind))
return input_ids

A possible loop-free approach is this:
import torch
# using the provided sample data
start_value, end_value = 4,9
data = torch.tensor([
[3,4,7,8,9,2,0,0,0,0],
[1,5,3,4,7,2,8,9,10,0],
[3,4,7,8,10,0,0,0,0,0], # does not contain end value
[3,7,5,9,2,0,0,0,0,0], # does not contain start value
[3,7,5,8,2,0,0,0,0,0], # does not contain start or end value
])
First, check which rows contain only a start_value or an end_value and fill these rows with 0.
# fill 'invalid' rows with 0
starts = (data == start_value)
ends = (data == end_value)
invalid = ((starts.sum(axis=1) - ends.sum(axis=1)) != 0)
data[invalid] = 0
Then set the values up to (and including) the start_value and after (and including) the end_value to 0 in each row. This step targets mainly the 'valid' rows. Nevertheless, all other rows will (again) be overwritten with zeros.
# set values in the start and end of 'valid rows' to 0
row_length = data.shape[1]
start_idx = starts.long().argmax(axis=1)
start_mask = (start_idx[:,None] - torch.arange(row_length))>=0
data[start_mask] = 0
end_idx = row_length - ends.long().argmax(axis=1)
end_mask = (end_idx[:,None] + torch.arange(row_length))>=row_length
data[end_mask] = 0
Note: This works also, if a row contains neither a start_value nor an end_value (I added such a row to the sample data). Still, there are many more edge cases that one could think of (e.g. multiple start and end values in one row, start value after end value, ...). Not sure if they are of relevance for the specific problem.
Comparison of execution time
Using timeit and randomly generated data to compare the execution time of the different approaches suggests, that the approach without loops is considerably faster than the approach from the question. If the data is converted to numpy first and converted back to Pytorch afterwards some further (very minor) time savings are possible.
Each dot (execution time) in the plot is the minimum value of 3 trials each with 100 repetitions.

this is my attempt at improving #rosa b. algorithm.
Could you try this:
def function1(
data: torch.Tensor,
start_value: int,
end_value: int,
):
# fill 'invalid' rows with 0
row_length = data.shape[1]
starts = (data == start_value)
ends = (data == end_value)
invalid = ((starts.sum(axis=1) - ends.sum(axis=1)) != 0)
data[invalid] = 0
valid_ind = torch.where(torch.logical_not(invalid))
# set values in the start and end of 'valid rows' to 0
arange_arr = torch.arange(row_length)
start_idx = starts.long()[valid_ind].argmax(axis=1)
start_mask = (start_idx[:, None] - arange_arr) >= 0
end_idx = row_length - ends.long()[valid_ind].argmax(axis=1)
end_mask = (end_idx[:, None] + arange_arr) >= row_length
mask = torch.logical_or(start_mask, end_mask)
tmp = data[valid_ind]
tmp.masked_fill_(mask, 0)
data[valid_ind] = tmp
return data
The main idea is I think the list of valid indexes is small. Therefore, we could skip many computations. I make some other minor updates so it should be slightly faster.
(Sorry I don't have enough reputation to make a comment).

Realign indexes to a changed python collection

I have a collection of data and a variable containing indexes to some of them.
A filtering operation is applied on the data that eliminates a subset of the data.
I want to shift the indexes so that they refer to the updated collection of data (eliminating indexes to deleted instances).
I'm using the implementation in the function below. I'm also posting the code I used to validate that it works.
Is there a quick & fast way to do the index realignment via the core libraries or a better way in general?
import random
def align_index(wanted_idx, mask):
"""
Function to align a set of indexes to a collection after deletions,
indicated with a mask
Arguments:
wanted_idx: List of desired integer indexes prior to deletion
mask: Binary mask, where 1's indicate elements that survive deletion
Returns:
List of integer indexes to (surviving) desired elements, post-deletion
"""
# rebuild indexes: remove dangling
new_idx = [idx for (i, idx) in enumerate(wanted_idx) if mask[idx]]
# mark deleted
not_mask = [int(not m) for m in mask]
# cumsum deleted regions
realigned_idx = [k-sum(not_mask[:k+1]) for k in new_idx]
return realigned_idx
# data
data = [random.randint(0,500) for _ in range(1000)]
rng = list(range(len(data)))
for _ in range(1000):
# random data deletion / request
wanted_idx = random.sample(rng, random.randint(5,100))
del_index = random.sample(rng, random.randint(5, 100))
# apply deletion
mask = [int(i not in del_index) for i in range(len(data))]
filtered_data = [data[i] for (i, m) in enumerate(mask) if m]
realigned_index = align_index(wanted_idx, mask)
# verify
new_idx = [idx for (i, idx) in enumerate(wanted_idx) if mask[idx]]
l1 = [data[k] for k in new_idx]
l2 = [filtered_data[k] for k in realigned_index]
assert l1 == l2

If you use numpy it's quite trivial:
import numpy as np
mask = np.array(mask, dtype=np.bool)
new_idx = np.cumsum(mask, dtype=np.int64)
new_idx[mask] = -1
You shouldn't need to recompute new_idx unless more elements get deleted.
Then you can get the remapped index for old index i just by looking new_idx[i]. Or a whole array at once:
wanted_idx = np.array(wanted_idx, dtype=np.int64)
remapped_idx = new_idx[wanted_idx]
Note that deleted indices get assigned value -1. You can filter these out if you want:
remapped_idx = remapped_idx[remapped_idx >= 0]

Longest of k consecutive strings

I'm defining a Python function to determine the longest string if the original strings are combined for every k consecutive strings. The function takes two parameters, strarr and k.
Here is an example:
max_consec(["zone", "abigail", "theta", "form", "libe", "zas", "theta", "abigail"], 2) --> "abigailtheta"
Here's my code so far (instinct is that I'm not passing k correctly within the function)
def max_consec(strarr, k):
lenList = []
for value in zip(strarr, strarr[k:]):
consecStrings = ''.join(value)
lenList.append(consecStrings)
for word in lenList:
if max(word):
return word
Here is a test case not passing:
testing(longest_consec(["zone", "abigail", "theta", "form", "libe", "zas"], 2), "abigailtheta")
My output:
'zonetheta' should equal 'abigailtheta'

It's not quite clear to me what you mean by "every k consecutive strings", but if you mean taking k-length slices of the list and concatenating all the strings in each slice, for example
['a', 'bb', 'ccc', 'dddd'] # k = 3
becomes
['a', 'bb', 'ccc']
['bb', 'ccc', 'dddd']
then
'abbccc'
'bbcccddd'
then this works ...
# for every item in strarr, take the k-length slice from that point and join the resulting strings
strs = [''.join(strarr[i:i + k]) for i in range(len(strarr) - k + 1)]
# find the largest by `len`gth
max(strs, key=len)
this post gives alternatives, though some of them are hard to read/verbose

Store the string lengths in an array. Now assume a window of size k passing through this list. Keep track of the sum in this window and starting point of the window.
When window reaches the end of the array you should have maximum sum and index where the maximum occurs. Construct the result with the elements from this window.
Time complexity: O(size of array + sum of all strings sizes) ~ O(n)
Also add some corner case handling when k > array_size or k <= 0
def max_consec(strarr, k):
size = len(strarr)
# corner cases
if k > size or k <= 0:
return "None" # or None
# store lengths
lenList = [len(word) for word in strarr]
print(lenList)
max_sum = sum(lenList[:k]) # window at index 0
prev_sum = max_sum
max_index = 0
for i in range(1, size - k + 1):
length = prev_sum - lenList[i-1] + lenList[i + k - 1] # window sum at i'th index. Subract previous sum and add next sum
prev_sum = length
if length > max_sum:
max_sum = length
max_index = i
return "".join(strarr[max_index:max_index+k]) # join strings in the max sum window
word = max_consec(["zone", "abigail", "theta", "form", "libe", "zas", "theta", "abigail"], 2)
print(word)

def max_consec(strarr, k):
n = -1
result = ""
for i in range(len(strarr)):
s = ''.join(strarr[i:i+k])
if len(s) > n:
n = len(s)
result = s
return result
Iterate over the list of strings and create a new string concatenating it with next k strings
Check if the newly created string is the longest. If so memorize it
Repeat the above steps until the iterations complete
return the memorised string

If I understand your question correctly. You have to eliminate duplicate values (in this case with set), sort them by length and concatenate the k longest words.
>>> def max_consec(words, k):
... words = sorted(set(words), key=len, reverse=True)
... return ''.join(words[:k])
...
>>> max_consec(["zone", "abigail", "theta", "form", "libe", "zas", "theta", "abigail"], 2)
'abigailtheta'
Update:
If the k elements should be consecutive. You can create pairs of consecutive words (in this case with zip). And return the longest if they get joined.
>>> def max_consec(words, k):
... return max((''.join(pair) for pair in zip(*[words[i:] for i in range(k)])), key=len)
...
>>> max_consec(["zone", "abigail", "theta", "form", "libe", "zas", "theta", "abigail"], 2)
'abigailtheta'

Creating 'n' combinations randomly from multiple lists

def models():
default = [0.6,0.67,2.4e-2,1e-2,2e-5,1.2e-3,2e-5]
lower = [np.log10(i/10) for i in default]
upper = [np.log10(i*10) for i in default]
n = 5
a = np.logspace(lower[0],upper[0],n)
b = np.logspace(lower[1],upper[1],n)
c = np.logspace(lower[2],upper[2],n)
d = np.logspace(lower[3],upper[3],n)
e = np.logspace(lower[4],upper[4],n)
f = np.logspace(lower[5],upper[5],n)
g = np.logspace(lower[6],upper[6],n)
combs = itertools.product(a,b,c,d,e,f,g)
list1 = []
for x in combs:
x = list(x)
list1.append(x)
return list1
The code above returns a list of 5^7 = 78,125 lists. Is there a way I can combine items in a,b,c,d,e,f,g, possibly randomly, to create a list of say, 10000, lists?

You could take random samples of each array and combine them, especially if you don't need to guarantee that specific combinations don't occur more than once:
import numpy as np
import random
def random_models(num_values):
n = 5
default = [0.6, 0.67, 2.4e-2, 1e-2, 2e-5, 1.2e-3, 2e-5]
ranges = zip((np.log10(i/10) for i in default),
(np.log10(i*10) for i in default))
data_arrays = []
for lower, upper in ranges:
data_arrays.append(np.logspace(lower, upper, n))
results = []
for i in xrange(num_values):
results.append([random.choice(arr) for arr in data_arrays])
return results
l = random_models(10000)
print len(l)
Here's a version that will avoid repeats up until you request more data than can be given without repeating:
def random_models_avoid_repeats(num_values):
n = 5
default = [0.6, 0.67, 2.4e-2, 1e-2, 2e-5, 1.2e-3, 2e-5]
# Build the range data (tuples of (lower, upper) range)
ranges = zip((np.log10(i/10) for i in default),
(np.log10(i*10) for i in default))
# Create the data arrays to sample from
data_arrays = []
for lower, upper in ranges:
data_arrays.append(np.logspace(lower, upper, n))
sequence_data = []
for entry in itertools.product(*data_arrays):
sequence_data.append(entry)
results = []
# Holds the current choices to choose from. The data will come from
# sequence_data above, but randomly shuffled. Values are popped off the
# end to keep things efficient. It's possible to ask for more data than
# the samples can give without repeats. In that case, we'll reload
# temp_data, randomly shuffle again, and start the process over until we've
# delivered the number of desired results.
temp_data = []
# Build the lists
for i in xrange(num_values):
if len(temp_data) == 0:
temp_data = sequence_data[:]
random.shuffle(temp_data)
results.append(temp_data.pop())
return results
Also note that we can avoid building a results list if you make this a generator by using yield. However, you'd want to consume the results using a forstatement as well:
def random_models_avoid_repeats_generator(num_values):
n = 5
default = [0.6, 0.67, 2.4e-2, 1e-2, 2e-5, 1.2e-3, 2e-5]
# Build the range data (tuples of (lower, upper) range)
ranges = zip((np.log10(i/10) for i in default),
(np.log10(i*10) for i in default))
# Create the data arrays to sample from
data_arrays = []
for lower, upper in ranges:
data_arrays.append(np.logspace(lower, upper, n))
sequence_data = []
for entry in itertools.product(*data_arrays):
sequence_data.append(entry)
# Holds the current choices to choose from. The data will come from
# sequence_data above, but randomly shuffled. Values are popped off the
# end to keep things efficient. It's possible to ask for more data than
# the samples can give without repeats. In that case, we'll reload
# temp_data, randomly shuffle again, and start the process over until we've
# delivered the number of desired results.
temp_data = []
# Build the lists
for i in xrange(num_values):
if len(temp_data) == 0:
temp_data = sequence_data[:]
random.shuffle(temp_data)
yield temp_data.pop()
You'd have to use it like this:
for entry in random_models_avoid_repeats_generator(10000):
# Do stuff...
Or manually iterate over it using next().

Does NumPy have a function equivalent to Matlab's buffer?

I see there is an array_split and split methods but these are not very handy when you have to split an array of length which is not integer multiple of the chunk size. Moreover, these method’s input is the number of slices rather than the slice size. I need something more like Matlab's buffer method which is more suitable for signal processing.
For example, if I want to buffer a signals to chunks of size 60 I need to do: np.vstack(np.hsplit(x.iloc[0:((len(x)//60)*60)], len(x)//60)) which is cumbersome.

I wrote the following routine to handle the use cases I needed, but I have not implemented/tested for "underlap".
Please feel free to make suggestions for improvement.
def buffer(X, n, p=0, opt=None):
'''Mimic MATLAB routine to generate buffer array
MATLAB docs here: https://se.mathworks.com/help/signal/ref/buffer.html
Parameters
----------
x: ndarray
Signal array
n: int
Number of data segments
p: int
Number of values to overlap
opt: str
Initial condition options. default sets the first `p` values to zero,
while 'nodelay' begins filling the buffer immediately.
Returns
-------
result : (n,n) ndarray
Buffer array created from X
'''
import numpy as np
if opt not in [None, 'nodelay']:
raise ValueError('{} not implemented'.format(opt))
i = 0
first_iter = True
while i < len(X):
if first_iter:
if opt == 'nodelay':
# No zeros at array start
result = X[:n]
i = n
else:
# Start with `p` zeros
result = np.hstack([np.zeros(p), X[:n-p]])
i = n-p
# Make 2D array and pivot
result = np.expand_dims(result, axis=0).T
first_iter = False
continue
# Create next column, add `p` results from last col if given
col = X[i:i+(n-p)]
if p != 0:
col = np.hstack([result[:,-1][-p:], col])
i += n-p
# Append zeros if last row and not length `n`
if len(col) < n:
col = np.hstack([col, np.zeros(n-len(col))])
# Combine result with next row
result = np.hstack([result, np.expand_dims(col, axis=0).T])
return result

def buffer(X = np.array([]), n = 1, p = 0):
#buffers data vector X into length n column vectors with overlap p
#excess data at the end of X is discarded
n = int(n) #length of each data vector
p = int(p) #overlap of data vectors, 0 <= p < n-1
L = len(X) #length of data to be buffered
m = int(np.floor((L-n)/(n-p)) + 1) #number of sample vectors (no padding)
data = np.zeros([n,m]) #initialize data matrix
for startIndex,column in zip(range(0,L-n,n-p),range(0,m)):
data[:,column] = X[startIndex:startIndex + n] #fill in by column
return data

This Keras function may be considered as a Python equivalent of MATLAB Buffer().
See the Sample Code :
import numpy as np
S = np.arange(1,99) #A Demo Array
See Output Here
import tensorflow.keras.preprocessing as kp
list(kp.timeseries_dataset_from_array(S, targets = None,sequence_length=7,sequence_stride=7,batch_size=5))
See the Buffered Array Output Here
Reference : See This

Same as the other answer, but faster.
def buffer(X, n, p=0):
'''
Parameters
----------
x: ndarray
Signal array
n: int
Number of data segments
p: int
Number of values to overlap
Returns
-------
result : (n,m) ndarray
Buffer array created from X
'''
import numpy as np
d = n - p
m = len(X)//d
if m * d != len(X):
m = m + 1
Xn = np.zeros(d*m)
Xn[:len(X)] = X
Xn = np.reshape(Xn,(m,d))
Xne = np.concatenate((Xn,np.zeros((1,d))))
Xn = np.concatenate((Xn,Xne[1:,0:p]), axis = 1)
return np.transpose(Xn[:-1])

ryanjdillon's answer rewritten for significant performance improvement; it appends to a list instead of concatenating arrays, latter which copies the array iteratively and is much slower.
def buffer(x, n, p=0, opt=None):
if opt not in ('nodelay', None):
raise ValueError('{} not implemented'.format(opt))
i = 0
if opt == 'nodelay':
# No zeros at array start
result = x[:n]
i = n
else:
# Start with `p` zeros
result = np.hstack([np.zeros(p), x[:n-p]])
i = n-p
# Make 2D array, cast to list for .append()
result = list(np.expand_dims(result, axis=0))
while i < len(x):
# Create next column, add `p` results from last col if given
col = x[i:i+(n-p)]
if p != 0:
col = np.hstack([result[-1][-p:], col])
# Append zeros if last row and not length `n`
if len(col):
col = np.hstack([col, np.zeros(n - len(col))])
# Combine result with next row
result.append(np.array(col))
i += (n - p)
return np.vstack(result).T

def buffer(X, n, p=0):
'''
Parameters:
x: ndarray, Signal array, input a long vector as raw speech wav
n: int, frame length
p: int, Number of values to overlap
-----------
Returns:
result : (n,m) ndarray, Buffer array created from X
'''
import numpy as np
d = n - p
#print(d)
m = len(X)//d
c = n//d
#print(c)
if m * d != len(X):
m = m + 1
#print(m)
Xn = np.zeros(d*m)
Xn[:len(X)] = X
Xn = np.reshape(Xn,(m,d))
Xn_out = Xn
for i in range(c-1):
Xne = np.concatenate((Xn,np.zeros((i+1,d))))
Xn_out = np.concatenate((Xn_out, Xne[i+1:,:]),axis=1)
#print(Xn_out.shape)
if n-d*c>0:
Xne = np.concatenate((Xn, np.zeros((c,d))))
Xn_out = np.concatenate((Xn_out,Xne[c:,:n-p*c]),axis=1)
return np.transpose(Xn_out)
here is a improved code of Ali Khodabakhsh's sample code which is not work in my cases. Feel free to comment and use it.

Comparing the execution time of the proposed answers, by running
x = np.arange(1,200000)
start = timer()
y = buffer(x,60,20)
end = timer()
print(end-start)
the results are:
Andrzej May, 0.005595300000095449
OverLordGoldDragon, 0.06954789999986133
ryanjdillon, 2.427092700000003

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Aid in Understanding Pre-Processing for Facebooks DLRM Code - python

Related

Torch filter multidimensional tensor by start and end values

Realign indexes to a changed python collection

Longest of k consecutive strings

Creating 'n' combinations randomly from multiple lists

Does NumPy have a function equivalent to Matlab's buffer?

Categories

Resources