How can I convert a regex to an NFA? - python

Are there any modules available in Python to convert a regular expression to corresponding NFA,
or do I have to build the code from scratch (by converting the regex from infix to postfix and then implementing Thompson's Algorithm to get the corresponding NFA)?
Is it possible in Python to get the state diagram of an NFA from the transition table?

keys=list(set(re.sub('[^A-Za-z0-9]+', '', regex)+'e'))
for i in regex:
if i in keys:
elif i=='*':
if start==r1:start=c1
if end==r2:end=c2
elif i=='.':
if start==r11:start=r21
if end==r22:end=r12
s[c1]['e']=(r21,r11); s[r12]['e']=c2; s[r22]['e']=c2
if start==r11 or start==r21:start=c1
if end==r22 or end==r12:end=c2
print keys
print s
this is the pretty much code sample after the postfix. s contains the transition table and keys contains all the terminals used including e. e is used for Epsilon.
It's completely based on Thompson's Algorithm.


Python Latex Library

I often work with groups of materials and my file/materials are named as alphanumeric strings. Is there a library to turn a string like r"Mxene - Ti3C2" to latex styled r"Mxene - Ti$_\mathrm{3}$C$_\mathrm{2}$"?
I usually use a dictionary but going through every name is a hassle and prone to error because materials can always be added or removed from the study.
I know that I can use str.maketrans() to generate subscripts but I haven't had very consistent results using the output with matplotlib so I'd much rather use latex.
I've ultimately created this solution in case anyone else needs it. Since my problem is mostly to create subscripts, the following code will look for numbers and replace them with a latex equivalent to create one.
def latexify(s):
import re
nums = re.findall(r'\d+', s)
pos = [[m.start(0), m.end(0)] for m in re.finditer(r'\d+', s)]
numpos = list(zip(nums, pos))
for num, pos in numpos:
string = f"$_\mathrm{{{num}}}$"
s = s[:pos[0]] + string + s[pos[1]:]
for ind, (n, [p_st, p_end]) in enumerate(numpos):
if p_st > pos[1]:
numpos[ind][1][0] += len(string)-len(num)
numpos[ind][1][1] += len(string)-len(num)
return s

ONNX Operators for Regex Replacements

QUESTION: How can I use ONNX operators to do string replacements with regular expressions?
I am trying to export a Scikit-Learn machine learning pipeline to the Open Neural Network Exchange (ONNX) format. The pipeline takes text as input. Many of the steps that are included in the pipeline are nicely included in the standard, like a TfIdfVectorizer and a TruncatedSVD transformer. However, the first pipeline step is a custom transformer which makes a set of changes to the input text through the exploitation of regular expressions.
When adding a custom transformer, the scikitlearn-onnx docs suggest that a custom shape and converter function should be written. The converter function in particular must be written by combining a set of predefined operators that exist within the ONNX standard. However, from what I can tell, it is not possible to do even basic string manipulation with the operators that exist.
One of the regular expression powered replacements that I want to make is a unit conversion, for example:
12m -> 12 meters
With Python's re package this is trivial:
import re
my_string = "The Empire State Building is 443m tall."
meters_pattern = re.compile("(?<=[0-9])m ")
my_transformed_string = re.sub(meters_pattern, " meters ", my_string)
>>> print(my_transformed_string)
The Empire State Building is 443 meters tall.
However, I cannot conceive of a way to do this with the available ONNX operators. Here's what I've thought to try:
Use a regular expression opererator in a similar manner to the Python example above.
Problem: ONNX does not have a regex operator.
Evaluate the input string sequentially, one character at a time. If an "m" follows a digit, change the string as described above.
Problem: This approach requires a comparison of strings: does "this character in the string" equal "m"? However, the existing OnnxEqual operator does not support string comparison.
Translate the input string, character by character, to it's ASCII decimal equivalent and then perform step 2.
Problem: ONNX does not have a translate-like operator (like GNU tr) for strings. ONNX also does not support casting non-strictly numeric strings with OnnxCast.
Use the OnnxUnique operator and it's inverse_indicies property to translate the input string to something approximating each character's ASCII decimal value.
Problem: This requires prepending a key string \t\n\r !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_``abcdefghijklmnopqrstuvwxyz{|}~ to the beginning of the input string (so that the numerical values found by OnnxUnique's inverse_indicies property have a consistent definition) and splitting the input string into a sequence of tensors of one character each. Unfortunately, OnnxSplit errors when trying to split a string tensor (see code example below), and OnnxSequenceInsert does not append strings into a single element tensor, just a sequence of single element tensors into a single tensor with multiple elements.
import re
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from skl2onnx import to_onnx, update_registered_converter
from skl2onnx.common.data_types import StringTensorType
from skl2onnx.algebra.onnx_ops import OnnxSplit, OnnxConstant
from onnxruntime import InferenceSession
class MyTransformer(BaseEstimator, TransformerMixin):
def fit_transform(self, X, y=None):
return re.sub("(?<=[0-9])m ", " meters ", X)
def shape_function(operator):
input = StringTensorType([1])
output = StringTensorType([None, 1])
operator.inputs[0].type = input
operator.outputs[0].type = output
def converter_function(scope, operator, container):
op = operator.raw_operator
opv = container.target_opset
out = operator.outputs
X = operator.inputs[0]
one_tensor = OnnxConstant(value_int=1, op_version=opv)
string_tensor = OnnxConstant(value_strings=["ab"], op_version=opv)
string_split_tensor = OnnxSplit(string_tensor, one_tensor, op_version=opv, output_names=out[:1])
string_split_tensor.add_to(scope, container)
update_registered_converter(MyTransformer, "MyTransformer", shape_function, converter_function)
my_transformer = MyTransformer()
onnx_model = to_onnx(my_transformer, initial_types=[["X", StringTensorType([None, 1])]])
test_string = "The Empire State Building is 443m tall."
sess = InferenceSession(onnx_model.SerializeToString())
output =, {"X": np.array([test_string])})
2022-08-16 12:35:46.235861185 [W:onnxruntime:, MergeShapeInfo] Error merging shape info for output. 'variable' source:{1} target:{,1}. Falling back to lenient
2022-08-16 12:35:46.237767860 [E:onnxruntime:, operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/optimizer/ onnxruntime::OptimizerExecutionFrame::Info::Info(const std::vector<const onnxruntime::Node*>&, const InitializedTensorSet&, const onnxruntime::Path&,
const onnxruntime::IExecutionProvider&, const std::function<bool(const std::__cxx11::basic_string<char>&)>&) [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : string tensor can not use pre-allocated buffer
How is one to properly manipulate strings with the available ONNX operators?
I asked the ONNX developers this question, and as of August 2022, it simply is not possible to perform REGEX replacements with ONNX operators. See the full thread here:

Python library to automatically detect unknown similar strings

I have a very big file with millions of paths to various executables on windows systems. A simple example would be the following:
As a human, I can recognize that these strings are similar and match them fairly easily with some regex in code. My issue however is to find these patterns in the first place as there are far too many of those, completely unknown to me and are changing frequently.
My goal is to write a python script that finds these similar strings with a degree of certainty and groups them for me.
Which methods, libraries, keywords etc. should I look into to solve this problem?
One possible way is to approach this by calculating the distance between strings. For that, you could use the textdistance lib.
Hope this helps!
Two starting points to get more familiarized with the subject:
Try fuzzywuzzy, a soft string matcher. It makes a difference if you keep the strings as they are or lower case them first:
from fuzzywuzzy import fuzz
import itertools
lines = [
for line1, line2 in itertools.combinations(lines, r=2):
case_match = fuzz.ratio(line1, line2)
insensitive_case_match = fuzz.ratio(line1.lower(), line2.lower())
print(line1[:10], '...', line1[:-10])
print(line2[:10], '...', line2[:-10])
print(case_match, insensitive_case_match)
C:\windows ... C:\windows\ccmcached\Deploy-Appli
C:\WINDOWS ... C:\WINDOWS\ccmcache\m\Deploy-Appli
80 95
C:\windows ... C:\windows\ccmcached\Deploy-Appli
user5323\A ... user5323\A-different-Appli
42 45
C:\WINDOWS ... C:\WINDOWS\ccmcache\m\Deploy-Appli
user5323\A ... user5323\A-different-Appli
40 45
One fairly straight-forward and simple way would be to simply check for "how much" a pair of strings differ. Like so:
import difflib
from collections import defaultdict
grouping_requirement = 0.75 # (0;1), the closer to 1, the stronger the equality needs to be to be grouped
s = r'''C:\windows\ccmcache\1d\Deploy-Application.exe
groups = defaultdict(list)
def match_ratio(s1,s2):
return difflib.SequenceMatcher(None,s1,s2).ratio()
for line in set(s.splitlines()):
for group in groups:
if match_ratio(group, line) > grouping_requirement:
for group in groups.values():
print(', '.join(group))
The output of this little application is:
C:\WINDOWS\ccmcache\1g\Deploy-Application.exe, C:\WINDOWS\ccmcache\m\Deploy-Application.exe, C:\windows\ccmcache\1l\Deploy-Application.exe, C:\WINDOWS\ccmcache\15\Deploy-Application.exe, C:\WINDOWS\ccmcache\7\Deploy-Application.exe, C:\WINDOWS\ccmcache\6\Deploy-Application.exe, C:\windows\ccmcache\2s\Deploy-Application.exe, C:\windows\ccmcache\1d\Deploy-Application.exe, C:\windows\ccmcache\2o\Deploy-Application.exe, C:\windows\ccmcache\2r\Deploy-Application.exe
C:\Users\user23452345\temp\test\1\Another1-Application.exe, C:\Users\user23hkjhf_5\temp\An0ther-Application.exe, C:\Users\user1324asdf\temp\Another-Applicatiooon.exe, C:\Users\user23452---5\temp\lili\Another-Application.exe
As you see on the top of the code snippet, you see that there is a constant, grouping_requirement, which I arbitrarily set to 0.75. If you reduce that value closer to 0.0, more paths will be grouped together, if you raise that value closer to 1.0, fewer paths will be grouped. Good luck!

How to solve Unicoder problem when reading csv

I am totally new to python. I am using a package that takes medical text and annotates it with classifiers called pyConTextNLP
It basically takes some natural language text, adds some 'modifiers' to it and classifies it whilst removing negative findings.
The problem I am having is how to add the list of modifiers as a csv or a yaml file. I have been following the basic setup instructions here:
The problem is the line here:
modifiers = itemData.get_items("")
itemData.get_items doesn't look like it exists anymore and there is a function instead called itemData.get_fileobj(). This takes a csv file as far as I understand and the csv is passed to the function markup.markItems(modifiers, mode="modifier") which looks at the text and 'marks up' any concepts in the raw text that match the modifiers.
The error that I get when trying to run the example code is:
if not `item.getLiteral() in compiledRegExprs:`
and this gives me the error:
AttributeError: 'UnicodeReader' object has no attribute 'getLiteral'
The whole code is here: but I have also written it below
import networkx as nx
import pyConTextNLP.itemData as itemData
import pyConTextNLP.pyConTextGraph as pyConText
reports = [
"""IMPRESSION: Evaluation limited by lack of IV contrast; however, no evidence of
bowel obstruction or mass identified within the abdomen or pelvis. Non-specific interstitial opacities and bronchiectasis seen at the right
base, suggestive of post-inflammatory changes.""",
MICRO These biopsies of large bowel mucosa show oedema of the lamina propriabut no architectural abnormality
There is no dysplasia or malignancy
There is no evidence of active inflammation
There is no increase in the inflammatory cell content of the lamina propria""" ,
1. 2.0 cm cyst of the right renal lower pole. Otherwise, normal appearance
of the right kidney with patent vasculature and no sonographic evidence of
renal artery stenosis.
2. Surgically absent left kidney.""",
"""IMPRESSION: No definite pneumothorax""",
"""IMPRESSION: New opacity at the left lower lobe consistent with pneumonia."""
modifiers = itemData.get_fileobj("/Applications/anaconda3/lib/python3.7/site-packages/pyConTextNLP-")
targets = itemData.get_fileobj("/Applications/anaconda3/lib/python3.7/site-packages/pyConTextNLP-")
def markup_sentence(s, modifiers, targets, prune_inactive=True):
markup = pyConText.ConTextMarkup()
markup.markItems(modifiers, mode="modifier")
markup.markItems(targets, mode="target")
# apply modifiers to any targets within the modifiers scope
if prune_inactive:
return markup
markup = pyConText.ConTextMarkup()
markup.markItems(modifiers, mode="modifier")
markup.markItems(targets, mode="target")
for node in markup.nodes(data=True):
for node in markup.nodes(data=True):
for edge in markup.edges():
markItems function is here:
def markItems(self, items, mode="target"):
"""tags the sentence for a list of items
items: a list of contextItems"""
if not items:
for item in items:
self.add_nodes_from(self.markItem(item, ConTextMode=mode),
The question is, how can I get the code to read the list in the csv file without throwing this error?

Converting an imperative algorithm into functional style

I wrote a simple procedure to calculate the average of the test coverage of some specific packages in a Java project. The raw data in a huge html file is like this:
package pkg1 <line_coverage>11/111,<branch_coverage>44/444<end>
package pkg2 <line_coverage>22/222,<branch_coverage>55/555<end>
package pkg3 <line_coverage>33/333,<branch_coverage>66/666<end>
Given the specified packages "pkg1" and "pkg3", for example, the average line coverage is:
and average branch coverage is:
I wrote the follow procedure to get the result and it works well. But how to implement this calculation in a functional style? Something like "(x,y) for x in ... for b in ... if...". I know a little Erlang, Haskell and Clojure, So solutions in these languages are also appreciated. Thanks a lot!
from __future__ import division
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for line in datafile:
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
match = ptn.match(line)
if match is not None:
cvln, tlln, cvbh, tlbh = match.groups()
covered_lines += int(cvln)
total_lines += int(tlln)
covered_branches += int(cvbh)
total_branches += int(tlbh)
print 'Line coverage:', '{:.2%}'.format(covered_lines / total_lines)
print 'Branch coverage:', '{:.2%}'.format(covered_branches/total_branches)
Down below you can find my Haskell solution. I will try to explain the important points I went through as I wrote it.
First you will find that I created a data structure for coverage data. It's generally a good idea to create data structures to represent whatever data you want to handle. This is in part because it makes it easier to design your code when you can think in terms of whatever you are designing – closely related to functional programming philosophies, and in part because it can eliminate a few bugs where you think you are doing something but are in actuality doing something else.
Related to the point before: The first thing I do is to convert the string-represented data into my own data structure. When you are doing functional programming, you are often doing things in "sweeps." You don't have a single function that converts data to your format, filters out the unwanted data and summarises the result. You have three different functions for each of those tasks, and you do them one at a time!
This is because functions are very composable, i.e. if you have three different ones, you can stick them together to form a single one if you want to. If you start with a single one, it is very difficult to take it apart to form three different ones.
The actual workings of the conversion function is actually quite uninteresting unless you are specifically doing Haskell. All it does is try to match each string with a regex, and if it succeeds, it adds the coverage data to the resulting list.
Again, mad composition is about to happen. I don't create a function to loop over a list of coverages and sum them up. I create a single function to sum two coverages, because I know I can use it together with the specialised fold loop (which is sort of like a for loop on steroids) to summarise all coverages in a list. There's no need for me to reinvent the wheel and create a loop myself.
Besides, my sumCoverages function works with a lot of specialised loops, so I don't have to write a ton of functions, I just stick my single function into a ton of pre-made library functions!
In the main function you will see what I mean by programming in "sweeps" or "passes" over the data. First I convert it to the internal format, then I filter out the unwanted data, then I summarise the remaining data. These are completely independent computations. That's functional programming.
You will also notice that I use two specialised loops there, filter and fold. This means that I don't have to write any loops myself, I just stick in a function to those standard library loops and let those take it from there.
import Data.Maybe (catMaybes)
import Data.List (foldl')
import Text.Printf (printf)
import Text.Regex (matchRegex, mkRegex)
corePkgs = ["d", "f"]
stats = [
format = mkRegex ".*(\\w+).*>([0-9]+)/([0-9]+).*>([0-9]+)/([0-9]+).*"
-- It might be a good idea to define a datatype for coverage data.
-- A bit of coverage data is defined as the name of the package it
-- came from, the lines covered, the total amount of lines, the
-- branches covered and the total amount of branches.
data Coverage = Coverage String Int Int Int Int
-- Then we need a way to convert the string data into a list of
-- coverage data. We do this by regex. We try to match on each
-- string in the list, and then we choose to keep only the successful
-- matches. Returned is a list of coverage data that was represented
-- by the strings.
convert :: [String] -> [Coverage]
convert = catMaybes . map match
where match line = do
[name, cl, tl, cb, tb] <- matchRegex format line
return $ Coverage name (read cl) (read tl) (read cb) (read tb)
-- We need a way to summarise two coverage data bits. This can of course also
-- be used to summarise entire lists of coverage data, by folding over it.
sumCoverage (Coverage nameA clA tlA cbA tbA) (Coverage nameB clB tlB cbB tbB) =
Coverage (nameA ++ nameB ++ ",") (clA + clB) (tlA + tlB) (cbA + cbB) (tbA + tbB)
main = do
-- First we need to convert the strings to coverage data
let coverageData = convert stats
-- Then we want to filter out only the relevant data
relevantData = filter (\(Coverage name _ _ _ _) -> name `elem` corePkgs) coverageData
-- Then we need to summarise it, but we are only interested in the numbers
Coverage _ cl tl cb tb = foldl' sumCoverage (Coverage "" 0 0 0 0) relevantData
-- So we can finally print them!
printf "Line coverage: %.2f\n" (fromIntegral cl / fromIntegral tl :: Double)
printf "Branch coverage: %.2f\n" (fromIntegral cb / fromIntegral tb :: Double)
Here are some quickly-hacked, untested ideas applied to your code:
import numpy as np
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
matches = map(datafile, ptn.match)
statsList = [map(int, match.groups()) for match in matches if matches]
# statsList is a list of [cvln, tlln, cvbh, tlbh]
stats = np.array(statsList)
covered_lines, total_lines, covered_branches, total_branches = stats.sum(axis=1)
Well, as you can see I haven't bothered to finish off the remaining loop, but I think the point is made by now. There's certainly a lot more than one way to do this; I elected to show off map() (which some will say makes this less efficient, and it probably does), as well as NumPy to get the (admittedly light) math done.
This is the corresponding Clojure solution:
(defn extract-data
"extract 4 integer from a string line according to a package name"
[pkg line]
(map read-string
(rest (first
(str pkg ".*>(\\d+)/(\\d+).*>(\\d+)/(\\d+)"))
(defn scan-lines-by-pkg
"scan all string lines and extract all data as integer sequences
according to package names"
[pkgs lines]
(filter seq (for [pkg pkgs
line lines]
(extract-data pkg line))))
(defn sum-data
"add all data in valid lines together"
[pkgs lines]
(apply map + (scan-lines-by-pkg pkgs lines)))
(defn get-percent
[covered all]
(str (format "%.2f" (float (/ (* covered 100) all))) "%"))
(defn get-cov
[pkgs lines]
{:line-cov (apply get-percent (take 2 (sum-data pkgs lines)))
:branch-cov (apply get-percent (drop 2 (sum-data pkgs lines)))})
(get-cov ["d" "f"] ["abc" "d>11/23d>34/89d" "e>25/65e>13/25e" "f>36/92f>19/76"])

