Efficient group substring search in Python? - python

Lets say I've loaded some information from a file into a Python3 dict and the result looks like this.
d = {
'hello' : ['hello', 'hi', 'greetings'],
'goodbye': ['bye', 'goodbye', 'adios'],
'lolwut': ['++$(#$(#%$(##*', 'ASDF #!## TOW']
}
Let's say I'm going to analyze a bunch, I mean an absolute ton, of strings. If a string contains any of the values for a given key of d, then I want to categorize it as being in that key.
For example...
'My name is DDP, greetings' => 'hello'
Obviously I can loop through the keys and values like this...
def classify(s, d):
for k, v in d.items():
if any([x in s for x in v]):
return k
return ''
But I want to know if there's a more efficient algorithm for this kind of bulk searching; more efficient than my naive loop. Is anyone aware of such an algorithm?

You can use regex to avoid extra operations. Here all you need is to join the words with a pip character and pass it to re.search(). Since the order or the exact word is not important to you this way you can find out if there's any intersection between any of those values and the given string.
import re
def classify(s, d):
for k, v in d.items():
regex = re.compile(re.escape(r'|'.join(v)))
if regex.search(s):
return k
Also note that you can, instead of returning k yield it to get an iterator of all occurrences or use a dictionary to store them, etc.

Related

Simplifying the code to a dictionary comprehension

In a directory images, images are named like - 1_foo.png, 2_foo.png, 14_foo.png, etc.
The images are OCR'd and the text extract is stored in a dict by the code below -
data_dict = {}
for i in os.listdir(images):
if str(i[1]) != '_':
k = str(i[:2]) # Get first two characters of image name and use as 'key'
else:
k = str(i[:1]) # Get first character of image name and use 'key'
# Intiates a list for each key and allows storing multiple entries
data_dict.setdefault(k, [])
data_dict[k].append(pytesseract.image_to_string(i))
The code performs as expected.
The images can have varying numbers in their name ranging from 1 to 99.
Can this be reduced to a dictionary comprehension?
No. Each iteration in a dict comprehension assigns a value to a key; it cannot update an existing value list. Dict comprehensions aren't always better--the code you wrote seems good enough. Although maybe you could write
data_dict = {}
for i in os.listdir(images):
k = i.partition("_")[0]
image_string = pytesseract.image_to_string(i)
data_dict.setdefault(k, []).append(image_string)
Yes. Here's one way, but I wouldn't recommend it:
{k: d.setdefault(k, []).append(pytesseract.image_to_string(i)) or d[k]
for d in [{}]
for k, i in ((i.split('_')[0], i) for i in names)}
That might be as clean as I can make it, and it's still bad. Better use a normal loop, especially a clean one like Dennis's.
Slight variation (if I do the abuse once, I might as well do it twice):
{k: d.setdefault(k, []).append(pytesseract_image_to_string(i)) or d[k]
for d in [{}]
for i in names
for k in i.split('_')[:1]}
Edit: kaya3 now posted a good one using a dict comprehension. I'd recommend that over mine as well. Mine are really just the dirty results of me being like "Someone said it can't be done? Challenge accepted!".
In this case itertools.groupby can be useful; you can group the filenames by the numeric part. But making it work is not easy, because the groups have to be contiguous in the sequence.
That means before we can use groupby, we need to sort using a key function which extracts the numeric part. That's the same key function we want to group by, so it makes sense to write the key function separately.
from itertools import groupby
def image_key(image):
return str(image).partition('_')[0]
images = ['1_foo.png', '2_foo.png', '3_bar.png', '1_baz.png']
result = {
k: list(v)
for k, v in groupby(sorted(images, key=image_key), key=image_key)
}
# {'1': ['1_foo.png', '1_baz.png'],
# '2': ['2_foo.png'],
# '3': ['3_bar.png']}
Replace list(v) with list(map(pytesseract.image_to_string, v)) for your use-case.

How to form a string by replacing dict key with its values

I have 2 items in dict, both has list of strings as values, I need to form and print strings by replacing all the substrings matching _key_ of the dict with its values (all possible combination).
e.g:
my_dict['season']=['spring','summer','fall']
my_dict['sport']=['baseball','soccer']
name='bobs-_season_-_sport_'
From above, I need to produce below output, by replacing pattern _season_ and _sport_ in name with all possible value combinations.
bobs-spring-baseball
bobs-spring-soccer
bobs-summer-baseball
bobs-summer-soccer
bobs-fall-baseball
bobs-fall-soccer
The below code works, but there should be a better way, also need to make this work when I have another item (3rd item) in my_dict. Thanks.
>>> my_dict=dict()
>>> my_dict['season']=['spring','summer','fall']
>>> my_dict['sport']=['baseball','soccer']
>>> name='bobs-_season_-_sport_'
>>>
>>> keys_list = list(my_dict.keys())
>>> if len(keys_list) == 2:
... first_key = keys_list[0]
... sec_key = keys_list[1]
... for first_value in my_dict[first_key]:
... name1 = name
... if f'_{first_key}_' in name1:
... name1 = name1.replace(f'_{first_key}_', first_value)
... for sec_value in my_dict[sec_key]:
... name2 = name1.replace(f'_{sec_key}_', sec_value)
... print(name2)
...
bobs-spring-baseball
bobs-spring-soccer
bobs-summer-baseball
bobs-summer-soccer
bobs-fall-baseball
bobs-fall-soccer
I suggest that you split the problem up into two parts. The first parses your input string (name) to figure out which keys it needs to use for replacements later. Then in a second step, you calculate all the combinations of values and do the substitutions.
I suggest using regular expressions for the parsing:
import re
keys = re.findall(r'_([^_]+)_', name)
Now make the combinations using itertools.product:
for values in itertools.product(*[my_dict[key] for key in keys]):
output = name
for key, value in zip(keys, values):
output = output.replace("_{}_".format(key), value)
print(output)
If you can choose the format you're using in name, a much nicer one (for your code) would be bob-{season}-{sport} because you can use it directly as a format string (rather than needing to repeatedly call str.replace). Using that style of input, you could replace the code above with:
keys = re.findall(r'\{([^}]+)\}', name)
for values in itertools.product(*[my_dict[key] for key in keys]):
print(name.format(**dict(zip(keys, values))))
I see in a recent comment that your dictionary may not always have keys for all of the key strings in name. If that's the case, then you should use my_dict.get(key, ['_{}_'.format(key)]) instead of my_dict[key] in the list comprehension inside of the product call. This will replace _unknownkey_ with itself, since there's nothing else to use instead.
You can use itertools.product to generate the different cases :
import itertools
my_dict = {}
my_dict['season']=['spring','summer','fall']
my_dict['sport']=['baseball','soccer']
name='bobs-_season_-_sport_'
for values in itertools.product(*my_dict.values()):
new_name = name
for key, value in zip(my_dict.keys(), values):
new_name = new_name.replace(f'_{key}_', value)
print(new_name)
Note that you need Python 3.6+ to ensure that dictionary insertion order is kept.
You can use re for finding the keys in the string, then create a suitable template string and use itertools.product for creating the value tuples:
from functools import reduce
import itertools as it
import re
keys = re.findall(r'_([a-z]+)_', name)
template = reduce(lambda s, k: s.replace(f'_{k}_', '{}'), keys, name)
result = list(template.format(*x) for x in it.product(*[my_dict[k] for k in keys]))
Or you can use it.starmap in place of the last line, if you find it clearer:
result = list(it.starmap(template.format, it.product(*[my_dict[k] for k in keys])))
Yes, you can do this using list comprehensions
https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions
my_dict=dict()
my_dict['season']=['spring','summer','fall']
my_dict['sport']=['baseball','soccer']
name='bobs-_season_-_sport_'
def generate_all_replaced(name, d):
names = [name]
for k in d.keys():
names = [n.replace(f'_{k}_', v) for v in d[k] for n in names]
return names
print(generate_all_replaced(name, my_dict))

A better way to rewrite multiple appended replace methods using an input array of strings in python?

I have a really ugly command where I use many appended "replace()" methods to replace/substitute/scrub many different strings from an original string. For example:
newString = originalString.replace(' ', '').replace("\n", '').replace('()', '').replace('(Deployed)', '').replace('(BeingAssembled)', '').replace('ilo_', '').replace('ip_', '').replace('_ilop', '').replace('_ip', '').replace('backupnetwork', '').replace('_ilo', '').replace('prod-', '').replace('ilo-','').replace('(EndofLife)', '').replace('lctcvp0033-dup,', '').replace('newx-', '').replace('-ilo', '').replace('-prod', '').replace('na,', '')
As you can see, it's a very ugly statement and makes it very difficult to know what strings are in the long command. It also makes it hard to reuse.
What I'd like to do is define an input array of of many replacement pairs, where a replacement pair looks like [<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>]; where the greater array looks something like:
replacementArray = [
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>],
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>],
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>],
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>]
]
AND, I'd like to pass that replacementArray, along with the original string that needs to be scrubbed to a function that has a structure something like:
def replaceAllSubStrings(originalString, replacementArray):
newString = ''
for each pair in replacementArray:
perform the substitution
return newString
MY QUESTION IS: What is the right way to write the function's code block to apply each pair in the replacementArray? Should I be using the "replace()" method? The "sub()" method? I'm confused as to how to restructure the original code into a nice clean function.
Thanks, in advance, for any help you can offer.
You have the right idea. Use sequence unpacking to iterate each pair of values:
def replaceAllSubStrings(originalString, replacementArray):
for in_rep, out_rep in replacementArray:
originalString = originalString.replace(in_rep, out_rep)
return originalString
How about using re?
import re
def make_xlat(*args, **kwds):
adict = dict(*args, **kwds)
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
def xlat(text):
return rx.sub(one_xlat, text)
return xlat
replaces = {
"a": "b",
"well": "hello"
}
replacer = make_xlat(replaces)
replacer("a well?")
# b hello?
You can add as many items in replaces as you want.

string replace in Python 2.7

Using Python 2.7 and working on below string replace problem, wondering if any better ideas in terms of algorithm space complexity and algorithm time complexity?
I create an additional list to represent result since string Python 2.7 is immutable and I also created an additional dictionary to speed-up look-up for character replacement table.
In the example, From: "lod" and To: "xpf" means when met with l, replace to x ; and when met with o, replace to p; and when met with d, replace to f.
'''
Given "data", "from", and "to" fields, replaces all occurrences of the characters in the "from" field in the "data" field, with their counterparts in the "to" field.
Example:
Input:
Data: "Hello World"
From: "lod"
To: "xpf"
Output:
"Hexxp Wprxf"
'''
from collections import defaultdict
def map_strings(from_field, to_field, data):
char_map = defaultdict(str)
result = []
for i,v in enumerate(from_field):
char_map[v]=to_field[i]
for v in data:
if v not in char_map:
result.append(v)
else:
result.append(char_map[v])
return ''.join(result)
if __name__ == "__main__":
print map_strings('lod', 'xpf', 'Hexxp Wprxf')
There's efficient machinery in the standard modules for this. You first build a translation table using string.maketrans, then call the str.translate method:
import string
trans = string.maketrans('lod', 'xpf')
print "Hello World".translate(trans)
output
Hexxp Wprxf
But if you want to do it manually, here's a way that's a little more efficient than your current code:
def map_strings(from_field, to_field, data):
char_map = dict(zip(from_field, to_field))
return ''.join([char_map.get(c, c) for c in data])
s = map_strings('lod', 'xpf', 'Hello World')
print s
Note that in Python 3 the string.maketrans function no longer exists. There's now a str.maketrans method, with slightly different behaviour.
You can also use replace:
def map_strings(from_field, to_field, data):
for f, t in zip(from_field, to_field):
data = data.replace(f, t)
return data

Sorting a list using a regex in Python

I have a list of email addresses with the following format:
name####email.com
But the number is not always present. For example: john45#email.com, bob#email.com joe2#email.com, etc. I want to sort these names by the number, with those without a number coming first. I have come up with something that works, but being new to Python, I'm curious as to whether there's a better way of doing it. Here is my solution:
import re
def sortKey(name):
m = re.search(r'(\d+)#', name)
return int(m.expand(r'\1')) if m is not None else 0
names = [ ... a list of emails ... ]
for name in sorted(names, key = sortKey):
print name
This is the only time in my script that I am ever using "sortKey", so I would prefer it to be a lambda function, but I'm not sure how to do that. I know this will work:
for name in sorted(names, key = lambda n: int(re.search(r'(\d+)#', n).expand(r'\1')) if re.search(r'(\d+)#', n) is not None else 0):
print name
But I don't think I should need to call re.search twice to do this. What is the most elegant way of doing this in Python?
Better using re.findall as if no numbers are found, then it returns an empty list which will sort before a populated list. The key used to sort is any numbers found (converted to ints), followed by the string itself...
emails = 'john45#email.com bob#email.com joe2#email.com'.split()
import re
print sorted(emails, key=lambda L: (map(int, re.findall('(\d+)#', L)), L))
# ['bob#email.com', 'joe2#email.com', 'john45#email.com']
And using john1 instead the output is: ['bob#email.com', 'john1#email.com', 'joe2#email.com'] which shows that although lexicographically after joe, the number has been taken into account first shifting john ahead.
There is a somewhat hackish way if you wanted to keep your existing method of using re.search in a one-liner (but yuck):
getattr(re.search('(\d+)#', s), 'groups', lambda: ('0',))()

Categories

Resources