Efficient way of finding potential substitutes in python string - python

Say I have a "ground truth" string such as ABA1234, and I have a "predicted" string that I want to compare it to, for example _ABA1234, and I have a list of acceptable substitutions for example
{
"A": ["_A", "A"],
"1": ["I", "1"],
}
What is the most efficient way of deciding whether or not the predicted string is equal to the target "ground truth" string with the given acceptable substitutions?
The brute force method - of generating candidates from the ground truth string by applying the substitutions is exponential, and not suitable for my purposes. Does anyone know a sub-exponential runtime algo for this kind of problem? Can regex help here?

As I said in comments, with the specific substitutions given by the OP as example, there isn't much ambiguity. Consider a regex pattern based on the given truth and substitutions:
>>> make_pattern('ABA1234', {'A': ['_A', 'A'], '1': ['I', '1']})
re.compile(r'(?:_A|A)B(?:_A|A)(?:I|1)234', re.UNICODE)
When the regex encounters '_A' in the test string, at that point in the pattern the corresponding possibilities are (?:_A|A), which is immediately decidable (no backtracking). When a 'I' is encountered, then the pattern would have (?:I|1), also decidable without backtracking. At the first non-match, the whole pattern is abandoned and the result is None.
As #KellyBundy shows, it is easy to construct simple examples that will lead to much higher backtracking complexity. For example:
{
'A': ['A', 'AB'],
'B': ['BC', 'C'],
}
This answer assumes that the "exponential complexity" the OP mentions refers to the size of the tree of possible candidates (all the strings that could match the truth given the substitutions; in the OP's example, that size is 8 (2**3) because there are 3 parts with two alternates each). But we don't have to explore that tree at all, as indicated above.
If the complexity comes from ambiguous substitutions instead (as per Kelly's examples), then this answer won't help.
With these caveats, as the question currently stands and per OP's confirmation, simply compiling a regex pattern gives excellent performance:
Edit: at OP's request, we now allow arbitrary prefix and suffix to be present in the test string.
import re
def make_pattern(truth, substitutions):
pat = ''.join([
f"(?:{'|'.join([re.escape(s) for s in substitutions.get(c)])})"
if c in substitutions else re.escape(c)
for c in truth
])
return re.compile(f'{pat}')
Then:
substitutions = {'A': ['_A', 'A'], '1': ['I', '1']}
truth = 'ABA1234'
pat = make_pattern(truth, substitutions)
test = 'xxxx_ABA1234xxxxx'
>>> bool(pat.search(test))
True
%timeit bool(pat.search(test))
330 ns ± 0.0872 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Let's generate arbitrarily longer random input:
import random
def gen(n, noise=.2):
base = list(substitutions.items()) + [
s for v in substitutions.values() for s in v
] + list('BCD02345_')
parts = random.choices(base, k=n) # may contain single elems and tuples
truth = ''.join([s[0] if isinstance(s, tuple) else s for s in parts])
test = ''.join([
random.choice(s[1]) if isinstance(s, tuple) else s for s in parts
])
# add noise prefix and suffix
a = [s[0] if isinstance(s, tuple) else s for s in base]
prefix = ''.join(random.choices(a, k=round(n * noise)))
suffix = ''.join(random.choices(a, k=round(n * noise)))
test = f'{prefix}{test}{suffix}'
return truth, test
Examples:
random.seed(0) # for reproducibility
n = 50
truth, test = gen(n)
>>> truth
'43BACB3ICD5CI30A5_45I252C1B05_C4A4DA2142AC5AI5_ADA_'
>>> test
'_5_CII1I5A43BACB3ICD5CI30A5_45I252C1B05_C4A4D_A2142_AC5AI5_ADA_3DI13IA_A4A'
pat = make_pattern(truth, substitutions)
>>> bool(pat.search(test))
True
%timeit bool(pat.search(test))
527 ns ± 0.174 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
With n = 10_000 (so both truth and test will be no smaller than that):
random.seed(0) # for reproducibility
n = 10_000
truth, test = gen(n)
>>> len(truth), len(test)
(10645, 15238)
pat = make_pattern(truth, substitutions)
>>> bool(pat.search(test))
True
%timeit bool(pat.search(test))
129 µs ± 21.6 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
(Meanwhile, building the pattern itself is very predictable --no backtracking-- and takes ~4ms for 10K-char string).

Related

How to truncate a string in python including a truncated word in the result?

There is an excellent discussion here Truncate a string without ending in the middle of a word on how to do a 'smart' string truncation in python.
But the problem with the solutions proposed there is that if the width limit falls within a word, then this word is thrown off completely.
How can I truncate a string in python setting a 'soft' width limit, i.e. if the limit falls in the middle of the word, then this word is kept?
Example:
str = "it's always sunny in philadelphia"
trunc(str, 7)
>>> it's always...
My initial thinking is to slice the string up to the soft limit and then start checking every next character, adding it to the slice until I encounter a whitespace character. But this seems extremely inefficient.
How about:
def trunc(ipt, length, suffix='...'):
if " " in ipt[length-1: length]:
# The given length puts us on a word boundary
return ipt[:length].rstrip(' ') + suffix
# Otherwise add the "tail" of the input, up to just before the first space it contains
return ipt[:length] + ipt[length:].partition(" ")[0] + suffix
s = "it's always sunny in philadelphia" # Best to avoid 'str' as a variable name, it's a builtin
for n in (1, 4, 5, 6, 7, 12, 13):
print(f"{n}: {trunc(s, n)}")
which outputs:
1: it's...
4: it's...
5: it's...
6: it's always...
7: it's always...
12: it's always...
13: it's always sunny...
Note the behaviour of the 5 and 12 cases: this code assumes that you want to eliminate the space that would appear before the "...".
Somehow I missed the answer provided in the linked post here by Markus Jarderot
def smart_truncate2(text, min_length=100, suffix='...'):
"""If the `text` is more than `min_length` characters long,
it will be cut at the next word-boundary and `suffix`will
be appended.
"""
pattern = r'^(.{%d,}?\S)\s.*' % (min_length-1)
return re.sub(pattern, r'\1' + suffix, text)
It runs for
3.49 µs ± 25.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
#slothrop's solution runs for:
897 ns ± 3.27 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
which is quite faster

Is there any faster way in python to split strings to sublists in a list with 1million elements?

I'm trying to help my friend to clean an order list dataframe with one million elements.
you can see that the product_name column should be a list, but they are in string type. So I want to split them into sublists.
Here's my code:
order_ls = raw_df['product_name'].tolist()
cln_order_ls = list()
for i in order_ls:
i = i.replace('[', '')
i = i.replace(']', '')
i = i.replace('\'', '')
cln_order_ls.append(i)
new_cln_order_ls = list()
for i in cln_order_ls:
new_cln_order_ls.append(i.split(', '))
But in the 'split' part, it took lots of time to process. I'm wondering is there faster way to deal with it ?
Thanks~
EDIT
(I did not like last answer, it was too much confused, so I reordered it and tested I little bit more systematically).
Long story short:
For speed, just use:
def str_to_list(s):
return s[1:-1].replace('\'', '').split(', ')
df['product_name'].apply(str_to_list).to_list()
Long story long:
Let's dissect your code:
order_ls = raw_df['product_name'].tolist()
cln_order_ls = list()
for i in order_ls:
i = i.replace('[', '')
i = i.replace(']', '')
i = i.replace('\'', '')
cln_order_ls.append(i)
new_cln_order_ls = list()
for i in cln_order_ls:
new_cln_order_ls.append(i.split(', '))
What you would really like to do is to have a function, say str_to_list() which converts your input string to a list.
For some reasons, you do it in multiple steps, but this is really not necessary. What you have so far, can be rewritten as:
def str_to_list_OP(s):
return s.replace('[', '').replace(']', '').replace('\'', '').split(', ')
If you can assume that [ and ] are always the first and last char of your string, you can simplify this to:
def str_to_list(s):
return s[1:-1].replace('\'', '').split(', ')
which should also be faster.
Alternative approaches would use regular expressions, e.g.:
def str_to_list_regex(s):
regex = re.compile(r'[\[\]\']')
return re.sub(regex, '', s).split(', ')
Note that all approaches so far use split(). This is a quite fast implementation which approach C speed and hardly any Python construct would beat it.
All these methods are quite unsafe as they do not take into account escaping properly, e.g. all of the above would fail for the following valid Python code:
['ciao', "pippo", 'foo, bar']
More robust alternative in this scenario would be:
ast.literal_eval which works for any valid Python code
json.loads which actually requires valid JSON strings so it is not really an option here.
The speed for these solutions is compared here:
As you can see, safety comes at the price of speed.
(these graphs are generated using these scripts with the following
def gen_input(n):
return str([str(x) for x in range(n)])
def equal_output(a, b):
return a == b
input_sizes = (5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000)
funcs = str_to_list_OP, str_to_list, str_to_list_regex, ast.literal_eval
runtimes, input_sizes, labels, results = benchmark(
funcs, gen_input=gen_input, equal_output=equal_output,
input_sizes=input_sizes)
Now let's concentrate to the looping. What you do is an explicit looping, and we know that Python is typically not terribly fast with that.
However, looping in a comprehension can be faster because it can generate more optimized code.
Another approach would be to use a vectorized expression using Pandas primitives, either using apply() or with .str. chainings.
The following timings are obtained, indicating comprehensions to be the fastest for smaller inputs, although the vectorized solution (using apply) catches up and eventually surpasses the comprehension:
The following test functions were used:
import pandas as pd
def str_to_list(s):
return s[1:-1].replace('\'', '').split(', ')
def func_OP(df):
order_ls = df['product_name'].tolist()
cln_order_ls = list()
for i in order_ls:
i = i.replace('[', '')
i = i.replace(']', '')
i = i.replace('\'', '')
cln_order_ls.append(i)
new_cln_order_ls = list()
for i in cln_order_ls:
new_cln_order_ls.append(i.split(', '))
return new_cln_order_ls
def func_QuangHoang(df):
return df['product_name'].str[1:-1].str.replace('\'','').str.split(', ').to_list()
def func_apply_df(df):
return df['product_name'].apply(str_to_list).to_list()
def func_compr(df):
return [str_to_list(s) for s in df['product_name']]
with the following test code:
def gen_input(n):
return pd.DataFrame(
columns=('order_id', 'product_name'),
data=[[i, "['ciao', 'pippo', 'foo', 'bar', 'baz']"] for i in range(n)])
def equal_output(a, b):
return a == b
input_sizes = (5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000)
funcs = func_OP, func_QuangHoang, func_apply_df, func_compr
runtimes, input_sizes, labels, results = benchmark(
funcs, gen_input=gen_input, equal_output=equal_output,
input_sizes=input_sizes)
again using the same base scripts as before.
How about:
(df['product_name']
.str[1:-1]
.str.replace('\'','')
.str.split(', ')
)
Try this
import ast
raw_df['product_name'] = raw_df['product_name'].apply(lambda x : ast.literal_eval(x))
I am curious about list comp as anky_91, so I gave it a try. I do list comp directly on ndarray to save time on calling tolist
n = raw_df['product_name'].values
[x[1:-1].replace('\'', '').split(', ') for x in n]
Sample data:
In [1488]: raw_df.values
Out[1488]:
array([["['C1', 'None', 'None']"],
["['C1', 'C2', 'None']"],
["['C1', 'C1', 'None']"],
["['C1', 'C2', 'C3']"]], dtype=object)
In [1491]: %%timeit
...: n = raw_df['product_name'].values
...: [x[1:-1].replace('\'', '').split(', ') for x in n]
...:
16.2 µs ± 614 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [1494]: %timeit my_func_2b(raw_df)
36.1 µs ± 489 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [1493]: %timeit my_func_2(raw_df)
39.1 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [1492]: %timeit raw_df['product_name'].str[1:-1].str.replace('\'','').str.sp
...: lit(', ').tolist()
765 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, listcomp with chain replace and split is fastest. Its speed is twice the next one. However, the save time is actually on using ndarray without calling tolist. If I add tolist, differences is not significant.

Combining a string and integer and printing

Given the Below Lists:
a = ['abc','cde','efg']
b = [[1,2,3],[2,3,4],[4,5,6]]
What is an optimized way to print the output as shown below:
Looking for an optimized way as in real I have about 100 x 100 elements.
Also keep in mind that each element in b is an integer while in a is a string
abc,1,2,3
cde,2,3,4
efg,4,5,6
To print in the exact format you specified:
print('\n'.join([a[i] + ',' + str(b[i]).strip('[').strip(']').replace(' ','') for i in range(len(a))]))
Output:
abc,1,2,3
cde,2,3,4
efg,4,5,6
100*100 element is a very small number for a python program - any optimization at this scale will probably fail to be significant enough for us humans to have noticed. To test:
%%timeit
array = np.random.randn(100,100)
print('\n'.join([str(e) for e in array])) # prints like above
result:
148 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Also - keep in mind the main bottle neck should be the print, not the actual process doing the printing, hence using zip or other trick may not work as they don't help the terminal/other stdout capture print fast enough.
Use range,
for i in range(len(b)):
print("{},{}".format(a[i],','.join([str(x) for x in b[i]])))
#output,
abc,1,2,3
cde,2,3,4
efg,4,5,6
You can try using zip() and str.join():
>>> a = ['abc','cde','efg']
>>> b = [[1,2,3],[2,3,4],[4,5,6]]
>>> print('\n'.join(','.join(map(str, (x, *y))) for x, y in zip(a, b)))
abc,1,2,3
cde,2,3,4
efg,4,5,6

Time complexity calculation for my algorithm

Given a string, find the first non-repeating character in it and return its index. If it doesn't exist, return -1. You may assume the string contain only lowercase letters.
I'm going to define a hash that tracks the occurrence of characters. Traverse the string from left to right, check if the current character is in the hash, continue if yes, otherwise in another loop traverse the rest of the string to see if the current character exists. Return the index if not and update the hash if it exists.
def firstUniqChar(s):
track = {}
for index, i in enumerate(s):
if i in track:
continue
elif i in s[index+1:]: # For the last element, i in [] holds False
track[i] = 1
continue
else:
return index
return -1
firstUniqChar('timecomplexity')
What's the time complexity (average and worst) of my algorithm?
Your algorithm has time complexity of O(kn) where k is the number of unique characters in the string. If k is a constant then it is O(n). As the problem description clearly bounds the number of alternatives for elements ("assume lower-case (ASCII) letters"), thus k is constant and your algorithm runs in O(n) time on this problem. Even though n will grow to infinite, you will only make O(1) slices of the string and your algorithm will remain O(n). If you removed track, then it would be O(n²):
In [36]: s = 'abcdefghijklmnopqrstuvwxyz' * 10000
In [37]: %timeit firstUniqChar(s)
100 loops, best of 3: 18.2 ms per loop
In [38]: s = 'abcdefghijklmnopqrstuvwxyz' * 20000
In [37]: %timeit firstUniqChar(s)
10 loops, best of 3: 36.3 ms per loop
In [38]: s = 'timecomplexity' * 40000 + 'a'
In [39]: %timeit firstUniqChar(s)
10 loops, best of 3: 73.3 ms per loop
It pretty much holds there that the T(n) is still of O(n) complexity - it scales exactly linearly with number of characters in the string, even though this is the worst-case scenario for your algorithm - there is no single character that is be unique.
I will present a not-that efficient, but simple and smart method here; count the character histogram first with collections.Counter; then iterate over the characters finding the one
from collections import Counter
def first_uniq_char_ultra_smart(s):
counts = Counter(s)
for i, c in enumerate(s):
if counts[c] == 1:
return i
return -1
first_uniq_char('timecomplexity')
This has time complexity of O(n); Counter counts the histogram in O(n) time and we need to enumerate the string again for O(n) characters. However in practice I believe my algorithm has low constants, because it uses a standard dictionary for Counter.
And lets make a very stupid brute-force algorithm. Since you can assume that the string contains only lower-case letters, then use that assumption:
import string
def first_uniq_char_very_stupid(s):
indexes = []
for c in string.ascii_lowercase:
if s.count(c) == 1:
indexes.append(s.find(c))
# default=-1 is Python 3 only
return min(indexes, default=-1)
Let's test my algorithm and some algorithms found in the other answers, on Python 3.5. I've chosen a case that is pathologically bad for my algorithm:
In [30]: s = 'timecomplexity' * 10000 + 'a'
In [31]: %timeit first_uniq_char_ultra_smart(s)
10 loops, best of 3: 35 ms per loop
In [32]: %timeit karin(s)
100 loops, best of 3: 11.7 ms per loop
In [33]: %timeit john(s)
100 loops, best of 3: 9.92 ms per loop
In [34]: %timeit nicholas(s)
100 loops, best of 3: 10.4 ms per loop
In [35]: %timeit first_uniq_char_very_stupid(s)
1000 loops, best of 3: 1.55 ms per loop
So, my stupid algorithm is the fastest, because it finds the a at the end and bails out. And my smart algorithm is slowest, One more reason for bad performance of my algorithm besides this being worst case is that OrderedDict is written in C on Python 3.5, and Counter is in Python.
Let's make a better test here:
In [60]: s = string.ascii_lowercase * 10000
In [61]: %timeit nicholas(s)
100 loops, best of 3: 18.3 ms per loop
In [62]: %timeit karin(s)
100 loops, best of 3: 19.6 ms per loop
In [63]: %timeit john(s)
100 loops, best of 3: 18.2 ms per loop
In [64]: %timeit first_uniq_char_very_stupid(s)
100 loops, best of 3: 2.89 ms per loop
So it appears that the "stupid" algorithm of mine isn't all that stupid at all, it exploits the speed of C while minimizing the number of iterations of Python code being run, and wins clearly in this problem.
As others have noted, your algorithm is O(n²) due to nested linear search. As discovered by #Antti, the OP's algorithm is linear and bound by O(kn) for k as the number of all possible lowercase letters.
My proposition for an O(n) solution:
from collections import OrderedDict
def first_unique_char(string):
duplicated = OrderedDict() # ordered dict of char to boolean indicating duplicate existence
for s in string:
duplicated[s] = s in duplicated
for char, is_duplicate in duplicated.items():
if not is_duplicate:
return string.find(char)
return -1
print(first_unique_char('timecomplexity')) # 4
Your algorithm is O(n2), because you have a "hidden" iteration over a slice of s inside the loop over s.
A faster algorithm would be:
def first_unique_character(s):
good = {} # char:idx
bad = set() # char
for index, ch in enumerate(s):
if ch in bad:
continue
if ch in good: # new repeat
bad.add(ch)
del good[ch]
else:
good[ch] = index
if not good:
return -1
return min(good.values())
This is O(n) because the in lookups use hash tables, and the number of distinct characters should be much less than len(s).

Normalize strings that represent (combinatorical) necklaces [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I'm trying to match "necklaces" of symbols in Python by looking up their linear representations, for which I use normal strings. For example, the strings "AABC", "ABCA", "BCAA", "CAAB" all represent the same necklace (pictured).
In order to get an overview, I store only one of the equivalent strings of a given necklace as a "representative". As to check if I've stored a candidate necklace, I need a function to normalize any given string representation. As a kind of pseudo code, I wrote a function in Python:
import collections
def normalized(s):
q = collections.deque(s)
l = list()
l.append(''.join(q))
for i in range(len(s)-1):
q.rotate(1)
l.append(''.join(q))
l.sort()
return l[0]
For all the string representations in the above example necklace, this function returns "AABC", which comes first alphabetically.
Since I'm relatively new to Python, I wonder - if I'd start implementing an application in Python - would this function be already "good enough" for production code? In other words: Would an experienced Python programmer use this function, or are there obvious flaws?
If I understand you correctly you first need to construct all circular permutations of the input sequence and then determine the (lexicographically) smallest element. That is the root of your symbol loop.
Try this:
def normalized(s):
L = [s[i:] + s[:i] for i in range(len(s))]
return sorted(L)[0]
This code works only with strings, no conversions between queue and string as in your code. A quick timing test shows this code to run in 30-50% of the time.
It would be interesting to know the length of s in your application. As all permutations have to be stored temporarily len(s)^2 bytes are needed for the temp list L. Hopefully this is not a constraint in your case.
Edit:
Today I stumbled upon the observation that if you concatenate the original string to itself it will contain all rotations as substrings. So the code will be:
def normalized4(s):
ss = s + s # contains all rotations of s as substrings
n = len(s)
return min((ss[i:i+n] for i in range(n)))
This will indeed run faster as there is only one concatination left plus n slicings. Using stringlengths of 10 to 10**5 the runtime is between 55% and 66% on my machine, compared to the min() version with a generator.
Please note that you trade off speed for memory consumption (2x) which doesn't matter here but might do in a different setting.
You could use min rather then sorting:
def normalized2(s):
return min(( s[i:] + s[:i] for i in range(len(s)) ))
But still it needs to copy string len(s) times. Faster way is to filter starting indexes of smallest char, until you get only one. Effectively search for smallest loop:
def normalized3(s):
ssize=len(s)
minchar= min(s)
minindexes= [ i for i in range(ssize) if minchar == s[i] ]
for offset in range(1,ssize):
if len( minindexes ) == 1 :
break
minchar= min( s[(i+offset)%ssize] for i in minindexes )
minindexes= [i for i in minindexes if minchar == s[(i+offset)%ssize]]
return s[minindexes[0]:] + s[:minindexes[0]]
For long string this is much faster:
In [143]: loop = [ random.choice("abcd") for i in range(100) ]
In [144]: timeit normalized(loop)
1000 loops, best of 3: 237 µs per loop
In [145]: timeit normalized2(loop)
10000 loops, best of 3: 91.3 µs per loop
In [146]: timeit normalized3(loop)
100000 loops, best of 3: 16.9 µs per loop
But if we have much of repetition, this method is not eficient:
In [147]: loop = "abcd" * 25
In [148]: timeit normalized(loop)
1000 loops, best of 3: 245 µs per loop
In [149]: timeit normalized2(loop)
100000 loops, best of 3: 18.8 µs per loop
In [150]: timeit normalized3(loop)
1000 loops, best of 3: 612 µs per loop
We can also scan forward the string, but I doubt it could be any faster, without some fancy algorithm.
How about something like this:
patterns = ['abc', 'bca', 'cab']
normalized = lambda p: ''.join(sorted(p))
normalized_patterns = set(normalized(p) for p in patterns)
Example output:
In [1]: normalized = lambda p: ''.join(sorted(p))
In [2]: normalized('abba')
Out[2]: 'aabb'
In [3]: normalized('CBAE')
Out[3]: 'ABCE'

Categories

Resources