QUESTION: How can I use ONNX operators to do string replacements with regular expressions?
I am trying to export a Scikit-Learn machine learning pipeline to the Open Neural Network Exchange (ONNX) format. The pipeline takes text as input. Many of the steps that are included in the pipeline are nicely included in the standard, like a TfIdfVectorizer and a TruncatedSVD transformer. However, the first pipeline step is a custom transformer which makes a set of changes to the input text through the exploitation of regular expressions.
When adding a custom transformer, the scikitlearn-onnx docs suggest that a custom shape and converter function should be written. The converter function in particular must be written by combining a set of predefined operators that exist within the ONNX standard. However, from what I can tell, it is not possible to do even basic string manipulation with the operators that exist.
One of the regular expression powered replacements that I want to make is a unit conversion, for example:
12m -> 12 meters
With Python's re package this is trivial:
import re
my_string = "The Empire State Building is 443m tall."
meters_pattern = re.compile("(?<=[0-9])m ")
my_transformed_string = re.sub(meters_pattern, " meters ", my_string)
>>> print(my_transformed_string)
The Empire State Building is 443 meters tall.
However, I cannot conceive of a way to do this with the available ONNX operators. Here's what I've thought to try:
Use a regular expression opererator in a similar manner to the Python example above.
Problem: ONNX does not have a regex operator.
Evaluate the input string sequentially, one character at a time. If an "m" follows a digit, change the string as described above.
Problem: This approach requires a comparison of strings: does "this character in the string" equal "m"? However, the existing OnnxEqual operator does not support string comparison.
Translate the input string, character by character, to it's ASCII decimal equivalent and then perform step 2.
Problem: ONNX does not have a translate-like operator (like GNU tr) for strings. ONNX also does not support casting non-strictly numeric strings with OnnxCast.
Use the OnnxUnique operator and it's inverse_indicies property to translate the input string to something approximating each character's ASCII decimal value.
Problem: This requires prepending a key string \t\n\r !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_``abcdefghijklmnopqrstuvwxyz{|}~ to the beginning of the input string (so that the numerical values found by OnnxUnique's inverse_indicies property have a consistent definition) and splitting the input string into a sequence of tensors of one character each. Unfortunately, OnnxSplit errors when trying to split a string tensor (see code example below), and OnnxSequenceInsert does not append strings into a single element tensor, just a sequence of single element tensors into a single tensor with multiple elements.
import re
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from skl2onnx import to_onnx, update_registered_converter
from skl2onnx.common.data_types import StringTensorType
from skl2onnx.algebra.onnx_ops import OnnxSplit, OnnxConstant
from onnxruntime import InferenceSession
class MyTransformer(BaseEstimator, TransformerMixin):
def fit_transform(self, X, y=None):
return re.sub("(?<=[0-9])m ", " meters ", X)
def shape_function(operator):
input = StringTensorType([1])
output = StringTensorType([None, 1])
operator.inputs[0].type = input
operator.outputs[0].type = output
def converter_function(scope, operator, container):
op = operator.raw_operator
opv = container.target_opset
out = operator.outputs
X = operator.inputs[0]
one_tensor = OnnxConstant(value_int=1, op_version=opv)
string_tensor = OnnxConstant(value_strings=["ab"], op_version=opv)
string_split_tensor = OnnxSplit(string_tensor, one_tensor, op_version=opv, output_names=out[:1])
string_split_tensor.add_to(scope, container)
update_registered_converter(MyTransformer, "MyTransformer", shape_function, converter_function)
my_transformer = MyTransformer()
onnx_model = to_onnx(my_transformer, initial_types=[["X", StringTensorType([None, 1])]])
test_string = "The Empire State Building is 443m tall."
sess = InferenceSession(onnx_model.SerializeToString())
output = sess.run(None, {"X": np.array([test_string])})
Yields:
2022-08-16 12:35:46.235861185 [W:onnxruntime:, graph.cc:106 MergeShapeInfo] Error merging shape info for output. 'variable' source:{1} target:{,1}. Falling back to lenient
merge.
2022-08-16 12:35:46.237767860 [E:onnxruntime:, inference_session.cc:1530 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/optimizer/optimizer_execution_frame.cc:75 onnxruntime::OptimizerExecutionFrame::Info::Info(const std::vector<const onnxruntime::Node*>&, const InitializedTensorSet&, const onnxruntime::Path&,
const onnxruntime::IExecutionProvider&, const std::function<bool(const std::__cxx11::basic_string<char>&)>&) [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : string tensor can not use pre-allocated buffer
How is one to properly manipulate strings with the available ONNX operators?
I asked the ONNX developers this question, and as of August 2022, it simply is not possible to perform REGEX replacements with ONNX operators. See the full thread here: https://github.com/onnx/onnx/issues/4450
Related
dfff is a Dataframes that already been tokenized and will be used to convert to tf-idf by using tfidfvectorizer.
This is dfff sample:
[![enter image description here][1]][1]
then I create a tfidfvectorizer one
tfidf_vecer2 = TfidfVectorizer(analyzer = 'word', token_pattern=None)
then I ran this code:
tfidf_vectorr= tfidf_vecer2.fit_transform(dfff)
tfidf_array = np.array(tfidf_vectorr.todense())
suddenly, TypeError occurred as an output and I still can't figure it out.
I tried to use a list instead of a dataframe but it still error. this is an output:
TypeError: first argument must be string or compiled pattern
Can't see your example dataframe, but let's assume it's something like this:
import nltk
df = pd.DataFrame({'text':["Though worlds of wanwood leafmeal lie",
"And yet you will weep and know why"]})
df['tokenized'] = df['text'].apply(nltk.word_tokenize)
text tokenized
0 Though worlds of wanwood leafmeal lie [Though, worlds, of, wanwood, leafmeal, lie]
1 And yet you will weep and know why [And, yet, you, will, weep, and, know, why]
Then you need a dummy function use as tokenizer, in order to leave the input as it is:
def func(x):
return x
tfidf_vec = TfidfVectorizer(tokenizer=func,analyzer='word',
preprocessor=func,token_pattern=None)
tfidf_vec.fit(df['tokenized'])
I was working on a project that requires me to add csv file in two places of the code. I have seen kinda similar problem here at stackoverflow. But their problem was due to old python version 2.5. But my python version is 3.8.
import csv
from tensorflow.keras.datasets import mnist
import numpy as np
def load_az_dataset("C:\A_Z_Handwritten_Data\A_Z_Handwritten_Data.csv"):
# initialize the list of data and labels
data = []
labels = []
# loop over the rows of the A-Z handwritten digit dataset
for row in open("C:\A_Z_Handwritten_Data\A_Z_Handwritten_Data.csv"):
# parse the label and image from the row
row = row.split(",")
label = int(row[0])
image = np.array([int(x) for x in row[1:]], dtype="uint8")
# images are represented as single channel (grayscale) images
# that are 28x28=784 pixels -- we need to take this flattened
# 784-d list of numbers and repshape them into a 28x28 matrix
image = image.reshape((28, 28))
# update the list of data and labels
data.append(image)
labels.append(label)
# convert the data and labels to NumPy arrays
data = np.array(data, dtype="float32")
labels = np.array(labels, dtype="int")
# return a 2-tuple of the A-Z data and labels
return (data, labels)
It's showing this syntax error
The syntax error is caused by the fact that the file path is in the parameter list in the function definition. This is the culprit:
def load_az_dataset("C:\A_Z_Handwritten_Data\A_Z_Handwritten_Data.csv"):
You have no parameters listed in the function definition. You just have a literal string.
Furthermore, you should also either be using raw strings: r"..." or escaping your backslashes, as others have mentioned.
Finally, you should be using the with open(file_path) as f: pattern to open your file.
The syntax error is caused since you are passing the literal string in the method declaration of load_az_dataset.
You need to define the parameter to the function as:
def load_az_dataset(fileName):
Further, if you want to add that file as the default value for the parameter then use:
def load_az_dataset(fileName="C:\\A_Z_Handwritten_Data\\A_Z_Handwritten_Data.csv"):
Also, unrelated to the problem, you need to escape the \ with another \.
Try:
open("C:\\A_Z_Handwritten_Data\\A_Z_Handwritten_Data.csv")
I'm trying to do some fuzzy matching on some OCR results, and I want to be able to factor in common OCR errors. In particular, I'm matching streets to a database of streets. I figured out how to down-weight common single-character substitution errors using the weighted-levenshtein package, but it seems to only work on single characters, when many of the most common errors are things like "li" to "h".
Right now, "Mam" matches most closely to "MAY ST," when I'd really like it to match to "MAIN ST" instead. I'd like to be able to build something in that knows that "IN" and "M" often correspond because "in" gets read as "m" by the OCR.
Here's the current code I'm working with (I'm down-weighting inserts because some of the streets I'm working with are abbreviated or missing "St", "Ave", etc.):
import numpy as np
from weighted_levenshtein import lev, osa, dam_lev
def lratio(str1,str2):
insert_costs = 0.5*np.ones(128, dtype=np.float64)
delete_costs = np.ones(128, dtype=np.float64)
substitute_costs = np.ones((128, 128), dtype=np.float64)
substitute_costs[ord('O'), ord('0')] = 0.25
substitute_costs[ord('0'), ord('O')] = 0.25
substitute_costs[ord('I'), ord('T')] = 0.5
substitute_costs[ord('T'), ord('I')] = 0.5
ldistance = lev(str1, str2, insert_costs=insert_costs, delete_costs=delete_costs, substitute_costs=substitute_costs)
return (1.0 - float(ldistance) / float(len(str1) + len(str2))) * 100.0
I don't think there's a way to modify weighted-levenshtein for multi-character substitution. But if there were, it would be great. And I bet there's a package out there that has this capacity--possibly a package that has a library of common errors already built in.
Any ideas?
I have obtained a dictionary mapping words to their vectors in python, and I am trying to scatter plot the n most similar words since TSNE on huge number of words is taking forever. The best option is to convert the dictionary to a w2v object to deal with it.
I had the same issue and I finaly found the solution
So, I assume that your dictionary looks like mine
d = {}
d['1'] = np.random.randn(300)
d['2'] = np.random.randn(300)
Basically, the keys are the users' ids and each of them has a vector with shape (300,).
So now, in order to use it as word2vec I need to firstly save it to binary file and then load it with gensim library
from numpy import zeros, dtype, float32 as REAL, ascontiguousarray, fromstring
from gensim import utils
m = gensim.models.keyedvectors.Word2VecKeyedVectors(vector_size=300)
m.vocab = d
m.vectors = np.array(list(d.values()))
my_save_word2vec_format(binary=True, fname='train.bin', total_vec=len(d), vocab=m.vocab, vectors=m.vectors)
Where my_save_word2vec_format function is:
def my_save_word2vec_format(fname, vocab, vectors, binary=True, total_vec=2):
"""Store the input-hidden weight matrix in the same format used by the original
C word2vec-tool, for compatibility.
Parameters
----------
fname : str
The file path used to save the vectors in.
vocab : dict
The vocabulary of words.
vectors : numpy.array
The vectors to be stored.
binary : bool, optional
If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
total_vec : int, optional
Explicitly specify total number of vectors
(in case word vectors are appended with document vectors afterwards).
"""
if not (vocab or vectors):
raise RuntimeError("no input")
if total_vec is None:
total_vec = len(vocab)
vector_size = vectors.shape[1]
assert (len(vocab), vector_size) == vectors.shape
with utils.smart_open(fname, 'wb') as fout:
print(total_vec, vector_size)
fout.write(utils.to_utf8("%s %s\n" % (total_vec, vector_size)))
# store in sorted order: most frequent words at the top
for word, row in vocab.items():
if binary:
row = row.astype(REAL)
fout.write(utils.to_utf8(word) + b" " + row.tostring())
else:
fout.write(utils.to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))
And then use
m2 = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format('train.bin', binary=True)
To load the model as word2vec
If you've calculated the word-vectors with your own code, you may want to write them to a file in a format compatible with Google's original word2vec.c or gensim. You can review the gensim code in KeyedVectors.save_word2vec_format() to see exactly how its vectors are written – it's less than 20 lines of code – and do something similar to your vectors. See:
https://github.com/RaRe-Technologies/gensim/blob/3d2227d58b10d0493006a3d7e63b98d64e991e60/gensim/models/keyedvectors.py#L130
Then you could re-load vectors that originated with your code and use them almost directly with examples like the one from Jeff Delaney you mention.
I wrote a simple procedure to calculate the average of the test coverage of some specific packages in a Java project. The raw data in a huge html file is like this:
<body>
package pkg1 <line_coverage>11/111,<branch_coverage>44/444<end>
package pkg2 <line_coverage>22/222,<branch_coverage>55/555<end>
package pkg3 <line_coverage>33/333,<branch_coverage>66/666<end>
...
</body>
Given the specified packages "pkg1" and "pkg3", for example, the average line coverage is:
(11+33)/(111+333)
and average branch coverage is:
(44+66)/(444+666)
I wrote the follow procedure to get the result and it works well. But how to implement this calculation in a functional style? Something like "(x,y) for x in ... for b in ... if...". I know a little Erlang, Haskell and Clojure, So solutions in these languages are also appreciated. Thanks a lot!
from __future__ import division
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for line in datafile:
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
match = ptn.match(line)
if match is not None:
cvln, tlln, cvbh, tlbh = match.groups()
covered_lines += int(cvln)
total_lines += int(tlln)
covered_branches += int(cvbh)
total_branches += int(tlbh)
print 'Line coverage:', '{:.2%}'.format(covered_lines / total_lines)
print 'Branch coverage:', '{:.2%}'.format(covered_branches/total_branches)
Down below you can find my Haskell solution. I will try to explain the important points I went through as I wrote it.
First you will find that I created a data structure for coverage data. It's generally a good idea to create data structures to represent whatever data you want to handle. This is in part because it makes it easier to design your code when you can think in terms of whatever you are designing – closely related to functional programming philosophies, and in part because it can eliminate a few bugs where you think you are doing something but are in actuality doing something else.
Related to the point before: The first thing I do is to convert the string-represented data into my own data structure. When you are doing functional programming, you are often doing things in "sweeps." You don't have a single function that converts data to your format, filters out the unwanted data and summarises the result. You have three different functions for each of those tasks, and you do them one at a time!
This is because functions are very composable, i.e. if you have three different ones, you can stick them together to form a single one if you want to. If you start with a single one, it is very difficult to take it apart to form three different ones.
The actual workings of the conversion function is actually quite uninteresting unless you are specifically doing Haskell. All it does is try to match each string with a regex, and if it succeeds, it adds the coverage data to the resulting list.
Again, mad composition is about to happen. I don't create a function to loop over a list of coverages and sum them up. I create a single function to sum two coverages, because I know I can use it together with the specialised fold loop (which is sort of like a for loop on steroids) to summarise all coverages in a list. There's no need for me to reinvent the wheel and create a loop myself.
Besides, my sumCoverages function works with a lot of specialised loops, so I don't have to write a ton of functions, I just stick my single function into a ton of pre-made library functions!
In the main function you will see what I mean by programming in "sweeps" or "passes" over the data. First I convert it to the internal format, then I filter out the unwanted data, then I summarise the remaining data. These are completely independent computations. That's functional programming.
You will also notice that I use two specialised loops there, filter and fold. This means that I don't have to write any loops myself, I just stick in a function to those standard library loops and let those take it from there.
import Data.Maybe (catMaybes)
import Data.List (foldl')
import Text.Printf (printf)
import Text.Regex (matchRegex, mkRegex)
corePkgs = ["d", "f"]
stats = [
"d>11/23d>34/89d",
"e>25/65e>13/25e",
"f>36/92f>19/76"
]
format = mkRegex ".*(\\w+).*>([0-9]+)/([0-9]+).*>([0-9]+)/([0-9]+).*"
-- It might be a good idea to define a datatype for coverage data.
-- A bit of coverage data is defined as the name of the package it
-- came from, the lines covered, the total amount of lines, the
-- branches covered and the total amount of branches.
data Coverage = Coverage String Int Int Int Int
-- Then we need a way to convert the string data into a list of
-- coverage data. We do this by regex. We try to match on each
-- string in the list, and then we choose to keep only the successful
-- matches. Returned is a list of coverage data that was represented
-- by the strings.
convert :: [String] -> [Coverage]
convert = catMaybes . map match
where match line = do
[name, cl, tl, cb, tb] <- matchRegex format line
return $ Coverage name (read cl) (read tl) (read cb) (read tb)
-- We need a way to summarise two coverage data bits. This can of course also
-- be used to summarise entire lists of coverage data, by folding over it.
sumCoverage (Coverage nameA clA tlA cbA tbA) (Coverage nameB clB tlB cbB tbB) =
Coverage (nameA ++ nameB ++ ",") (clA + clB) (tlA + tlB) (cbA + cbB) (tbA + tbB)
main = do
-- First we need to convert the strings to coverage data
let coverageData = convert stats
-- Then we want to filter out only the relevant data
relevantData = filter (\(Coverage name _ _ _ _) -> name `elem` corePkgs) coverageData
-- Then we need to summarise it, but we are only interested in the numbers
Coverage _ cl tl cb tb = foldl' sumCoverage (Coverage "" 0 0 0 0) relevantData
-- So we can finally print them!
printf "Line coverage: %.2f\n" (fromIntegral cl / fromIntegral tl :: Double)
printf "Branch coverage: %.2f\n" (fromIntegral cb / fromIntegral tb :: Double)
Here are some quickly-hacked, untested ideas applied to your code:
import numpy as np
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
matches = map(datafile, ptn.match)
statsList = [map(int, match.groups()) for match in matches if matches]
# statsList is a list of [cvln, tlln, cvbh, tlbh]
stats = np.array(statsList)
covered_lines, total_lines, covered_branches, total_branches = stats.sum(axis=1)
Well, as you can see I haven't bothered to finish off the remaining loop, but I think the point is made by now. There's certainly a lot more than one way to do this; I elected to show off map() (which some will say makes this less efficient, and it probably does), as well as NumPy to get the (admittedly light) math done.
This is the corresponding Clojure solution:
(defn extract-data
"extract 4 integer from a string line according to a package name"
[pkg line]
(map read-string
(rest (first
(re-seq
(re-pattern
(str pkg ".*>(\\d+)/(\\d+).*>(\\d+)/(\\d+)"))
line)))))
(defn scan-lines-by-pkg
"scan all string lines and extract all data as integer sequences
according to package names"
[pkgs lines]
(filter seq (for [pkg pkgs
line lines]
(extract-data pkg line))))
(defn sum-data
"add all data in valid lines together"
[pkgs lines]
(apply map + (scan-lines-by-pkg pkgs lines)))
(defn get-percent
[covered all]
(str (format "%.2f" (float (/ (* covered 100) all))) "%"))
(defn get-cov
[pkgs lines]
{:line-cov (apply get-percent (take 2 (sum-data pkgs lines)))
:branch-cov (apply get-percent (drop 2 (sum-data pkgs lines)))})
(get-cov ["d" "f"] ["abc" "d>11/23d>34/89d" "e>25/65e>13/25e" "f>36/92f>19/76"])