My code works but I feel like the while loop is possibly not as succinct as it could be.
Maybe using a while loop for a set of 2 items or less is silly? I'm not sure.
# <SETUP CODE TO SIMULATE MY SITUATION>
import random
import re
# The real data set is much larger than this (Around 1,000 - 10,000 items):
names = {"abc", "def", "123"}
if random.randint(0, 3):
# foo value is "foo" followed by a string of unknown digits:
names.add("foo" + str(random.randint(0, 1000)))
if random.randint(0, 3):
# bar value is just "bar":
names.add("bar")
print("names:", names)
matches = {name for name in names if re.match("foo|bar", name)}
print("matches:", matches)
# In the names variable, foo and/or bar may be missing, thus len(matches) should be 0-2:
assert len(matches) <= 2, "Somehow got more than 2 matches"
# </SETUP CODE TO SIMULATE MY SITUATION>
foo, bar = None, None
while matches:
match = matches.pop()
if match == "bar":
bar = match
else:
foo = match
print("foo:", foo)
print("bar:", bar)
And here's what else I've tried within the while loop:
I know ternaries don't work like this (at least not in Python) but this is the pipe-dream level of simplicity I was hoping for:
(bar if match == "bar" else foo) = match
The remove function doesn't return anything:
try:
bar = matches.remove("bar")
except KeyError:
foo = matches.pop()
The loop in your first code is ok, 10,000 inputs is really small at computer scale.
If you want to go slightly faster you can just browse your list match without popping elements (which takes more time), replacing simply
while matches:
match = matches.pop()
by
for match in matches:
Why don't you use simple for loop instead of while loop
for match in matches:
bar = match if match == 'bar' else foo = match
print("foo:", foo)
print("bar:", bar)
You don't have to remove the element from the set every time. Since your set only contains 2 or fewer elements :P. Maybe for larger sets you can delete the entire set after use by
del matches # will help in garbage collection.
In our case, this is not needed.
Related
Is there a way to search for a value in a dataframe column using FuzzyWuzzy or similar library?
I'm trying to find a value in one column that corresponds to the value in another while taking fuzzy matching into account. So
So for example, if I have State Names in one column and State Codes in another, how would I find the state code for Florida, which is FL while catering for abbreviations like "Flor"?
So in other words, I want to find a match for a State Name corresponding to "Flor" and get the corresponding State Code "FL".
Any help is greatly appreciated.
If the abbreviations are all prefixes, you can use the .startswith() string method against either the short or long version of the state.
>>> test_value = "Flor"
>>> test_value.upper().startswith("FL")
True
>>> "Florida".lower().startswith(test_value.lower())
True
However, if you have more complex abbreviations, difflib.get_close_matches will probably do what you want!
>>> import pandas as pd
>>> import difflib
>>> df = pd.DataFrame({"states": ("Florida", "Texas"), "st": ("FL", "TX")})
>>> df
states st
0 Florida FL
1 Texas TX
>>> difflib.get_close_matches("Flor", df["states"].to_list())
['Florida']
>>> difflib.get_close_matches("x", df["states"].to_list(), cutoff=0.2)
['Texas']
>>> df["st"][df.index[df["states"]=="Texas"]].iloc[0]
'TX'
You will probably want to try/except IndexError around reading the first member of the returned list from difflib and possibly tweak the cutoff to get less false matches with close states (perhaps offer all the states as possibilities to some user or require more letters for close states).
You may also see the best results combining the two; testing prefixes first before trying the fuzzy match.
Putting it all together
def state_from_partial(test_text, df, col_fullnames, col_shortnames):
if len(test_text) < 2:
raise ValueError("must have at least 2 characters")
# if there's exactly two characters, try to directly match short name
if len(test_text) == 2 and test_text.upper() in df[col_shortnames]:
return test_text.upper()
states = df[col_fullnames].to_list()
match = None
# this will definitely fail at least for states starting with M or New
#for state in states:
# if state.lower().startswith(test_text.lower())
# match = state
# break # leave loop and prepare to find the prefix
if not match:
try: # see if there's a fuzzy match
match = difflib.get_close_matches(test_text, states)[0] # cutoff=0.6
except IndexError:
pass # consider matching against a list of problematic states with different cutoff
if match:
return df[col_shortnames][df.index[df[col_fullnames]==match]].iloc[0]
raise ValueError("couldn't find a state matching partial: {}".format(test_text))
Beware of states which start with 'New' or 'M' (and probably others), which are all pretty close and will probably want special handling. Testing will do wonders here.
I have a very minimalistic code that performs autocompletion for input queries set by the user by storing a historical data of names(close to 1000) in a list. Right now, it gives suggestion in lexicographical smallest order.
The names stored in a list are (fictitious):
names = ["show me 7 wonders of the world","most beautiful places","top 10 places to visit","Population > 1000","Cost greater than 100"]
The queries given by the user can be:
queries = ["10", "greater", ">", "7 w"]
Current Implementation:
class Index(object):
def __init__(self, words):
index = {}
for w in sorted(words, key=str.lower, reverse=True):
lw = w.lower()
for i in range(1, len(lw) + 1):
index[lw[:i]] = w
self.index = index
def by_prefix(self, prefix):
"""Return lexicographically smallest word that starts with a given
prefix.
"""
return self.index.get(prefix.lower(), 'no matches found')
def typeahead(usernames, queries):
users = Index(usernames)
print "\n".join(users.by_prefix(q) for q in queries)
This works fine if the queries start with the pre-stored names. But fails to provide suggestions if a random entry is made(querying somewhere from the middle of string). It also does not recognize numbers and fails for that too.
I was wondering if there could be a way to include the above functionalities to improve my existing implementation.
Any help is greatly appreciated.
It's O(n) but it works. Your function is checking if it starts with a prefix, but the behavior you describe you want is checking if the string contains the query
def __init__(self, words):
self.index = sorted(words, key=str.lower, reverse=True)
def by_prefix(self, prefix):
for item in self.index:
if prefix in item:
return item
This gives:
top 10 places to visit
Cost greater than 100
Population > 1000
show me 7 wonders of the world
Just for the record this takes 0.175 seconds on my pc for 5 queries of 1,000,005 records, with the last 5 records being the matching ones. (Worst case scenario)
If you are not concerned by performance, you can use if prefix in item: for every item in your list names. This statement matches if prefix is part of the string item, e.g.:
prefix item match
'foo' 'foobar' True
'bar' 'foobar' True
'ob' 'foobar' True
...
I think that this is the simplest way to achieve this, but clearly not the fastest.
Another option is to add more entries to your index, e.g. for the item "most beautiful places":
"most beautiful places"
"beautiful places"
"places"
If you do this, you also get matches if you start typing a word that's not the first word in the sentence. You can modify your code like this to do that:
class Index(object):
def __init__(self, words):
index = {}
for w in sorted(words, key=str.lower, reverse=True):
lw = w.lower()
tokens = lw.split(' ')
for j in range(len(tokens)):
w_part = ' '.join(tokens[j:])
for i in range(1, len(w_part) + 1):
index[w_part[:i]] = w
self.index = index
The downside from this approach is that the index gets very large. You could also combine this approach with the one pointed out by Keatinge by storing 2-digit prefixes for every word in your index dictionary and store a list of queries that contain this prefix as items of the index dictionary.
I'm trying to create repaired path using 2 dicts created using groupdict() from re.compile
The idea is the swap out values from the wrong path with equally named values of the correct dict.
However, due to the fact they are not in the captured group order, I can't rebuild the resulting string as a correct path as the values are not in order that is required for path.
I hope that makes sense, I've only been using python for a couple of months, so I may be missing the obvious.
# for k, v in pat_list.iteritems():
# pat = re.compile(v)
# m = pat.match(Path)
# if m:
# mgd = m.groups(0)
# pp (mgd)
this gives correct value order, and groupdict() creates the right k,v pair, but in wrong order.
You could perhaps use something a bit like that:
pat = re.compile(r"(?P<FULL>(?P<to_ext>(?:(?P<path_file_type>(?P<path_episode>(?P<path_client>[A-Z]:[\\/](?P<client_name>[a-zA-z0-1]*))[\\/](?P<episode_format>[a-zA-z0-9]*))[\\/](?P<root_folder>[a-zA-Z0-9]*)[\\/])(?P<file_type>[a-zA-Z0-9]*)[\\/](?P<path_folder>[a-zA-Z0-9]*[_,\-]\d*[_-]?\d*)[\\/](?P<base_name>(?P<episode>[a-zA-Z0-9]*)(?P<scene_split>[_,\-])(?P<scene>\d*)(?P<shot_split>[_-])(?P<shot>\d*)(?P<version_split>[_,\-a-zA-Z]*)(?P<version>[0-9]*))))[\.](?P<ext>[a-zA-Z0-9]*))")
s = r"T:\Grimm\Grimm_EPS321\Comps\Fusion\G321_08_010\G321_08_010_v02.comp"
mat = pat.match(s)
result = []
for i in range(1, pat.groups):
name = list(pat.groupindex.keys())[list(pat.groupindex.values()).index(i)]
cap = res.group(i)
result.append([name, cap])
That will give you a list of lists, the smaller lists having the capture group as first item, and the capture group as second item.
Or if you want 2 lists, you can make something like:
names = []
captures = []
for i in range(1, pat.groups):
name = list(pat.groupindex.keys())[list(pat.groupindex.values()).index(i)]
cap = res.group(i)
names.append(name)
captures.append(cap)
Getting key from value in a dict obtained from this answer
Basically I want to turn a string in to an identifier for an object like so:
count = 0
for i in range(50):
count += 1
functionToMakeIdentifier("foo" + str(count)) = Object(init_variable)
I want to make a series of objects with names like foo1, foo2, foo3, foo4, foo5, etc... But I don't know how to turn those strings into identifiers for the objects. Help!
You don't. You use an array (aka list in Python), or a dictionary if you want/need to use something more fancy than consecutive integers (e.g. strings) for identifying the individual items.
For example:
foos = []
count = 0
for i in range(50):
count += 1
foos.append(Object(init_variable))
Afterwards, you can refer to the first foo as foos[0] and the 50th foo as foo[49] (indices start at 0 - sure seems weird, but once you get used to it, it's at least as fine as long as everybody agrees on one thing -- and Python encourages 0-based indices, e.g. range counts from 0).
Also, your code can be simplified further. If you just want to generate a list of Object instances, you can use list comprehension (will propably take a while until your class or book or tutorial covers this...). Also, in your specific example, count and i are identical and can thus be merged (and when you want to count along something you iterate like for item in items: ..., you can use for count, item in enumerate(items)).
Do this instead:
foos = [Object(init_variable) for _ in range(50)]
print(foos[0]) # first item
You now have a list of 50 Object items.
Maybe, if you really want to use these foo1 strings you could do
foo_dict = {'foo%d'.format(i) : Object(init_variable) for i in range(1,51)]
print(foo_dict['foo1'])
Your use case is absurd as one would use an array for that... but supposing you have a real need for it then the trick is using
globals()[name] = value
are you looking for a dictionary?
you can do this:
foos = {}
for i in range(50):
foos['foo%d' % i] = MyObject()
and create a dict with keys() like ['foo1', 'foo2', ...] to access your objects.
Use the exec function:
>>> for i in range(50):
... count += 1
... exec("foo%d = %d" % (count,count))
...
>>> foo1
1
>>> foo2
2
>>> foo3
3
Hi
I need filter out all rows that don't contain symbols from huge "necessary" list, example code:
def any_it(iterable):
for element in iterable:
if element: return True
return False
regexp = re.compile(r'fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ...] # huge list of 10 000 members
f = open("huge_file", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()
## File rows like, let's say:
# 1 djhds fruit=REDSOMETHING sdkjld
# 2 sdhfkjk fruit=GREENORANGE lkjfldk
# 3 dskjldsj fruit=YELLOWDOG sldkfjsdl
# 4 gfhfg fruit=REDSOMETHINGELSE fgdgdfg
filtered = (line for line in lines if any_it(regexp.findall(line)[0].startswith(x) for x in necessary))
I have python 2.4, so I can't use built-in any().
I wait a long time for this filtering, but is there some way to optimize it? For example row 1 and 4 contains "RED.." pattern, if we found that "RED.." pattern is ok, can we skip search in 10000-members list for row 4 the same pattern??
Is there some another way to optimize filtering?
Thank you.
...edited...
UPD: See real example data in comments to this post. I'm also interested in sorting by "fruits" the result. Thanks!
...end edited...
If you organized the necessary list as a trie, then you could look in that trie to check if the fruit starts with a valid prefix. That should be faster than comparing the fruit against every prefix.
For example (only mildly tested):
import bisect
import re
class Node(object):
def __init__(self):
self.children = []
self.children_values = []
self.exists = False
# Based on code at http://docs.python.org/library/bisect.html
def _index_of(self, ch):
i = bisect.bisect_left(self.children_values, ch)
if i != len(self.children_values) and self.children_values[i] == ch:
return (i, self.children[i])
return (i, None)
def add(self, value):
if len(value) == 0:
self.exists = True
return
i, child = self._index_of(value[0])
if not child:
child = Node()
self.children.insert(i, child)
self.children_values.insert(i, value[0])
child.add(value[1:])
def contains_prefix_of(self, value):
if self.exists:
return True
i, child = self._index_of(value[0])
if not child:
return False
return child.contains_prefix_of(value[1:])
necessary = ['RED', 'GREEN', 'BLUE', 'ORANGE', 'BLACK',
'LIGHTRED', 'LIGHTGREEN', 'GRAY']
trie = Node()
for value in necessary:
trie.add(value)
# Find lines that match values in the trie
filtered = []
regexp = re.compile(r'fruit=([A-Z]+)')
for line in open('whatever-file'):
fruit = regexp.findall(line)[0]
if trie.contains_prefix_of(fruit):
filtered.append(line)
This changes your algorithm from O(N * k), where N is the number of elements of necessary and k is the length of fruit, to just O(k) (more or less). It does take more memory though, but that might be a worthwhile trade-off for your case.
I'm convinced Zach's answer is on the right track. Out of curiosity, I've implemented another version (incorporating Zach's comments about using a dict instead of bisect) and folded it into a solution that matches your example.
#!/usr/bin/env python
import re
from trieMatch import PrefixMatch # https://gist.github.com/736416
pm = PrefixMatch(['YELLOW', 'GREEN', 'RED', ]) # huge list of 10 000 members
# if list is static, it might be worth picking "pm" to avoid rebuilding each time
f = open("huge_file.txt", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()
regexp = re.compile(r'^.*?fruit=([A-Z]+)')
filtered = (line for line in lines if pm.match(regexp.match(line).group(1)))
For brevity, implementation of PrefixMatch is published here.
If your list of necessary prefixes is static or changes infrequently, you can speed up subsequent runs by pickling and reusing the PickleMatch object instead of rebuilding it each time.
update (on sorted results)
According to the changelog for Python 2.4:
key should be a single-parameter function that takes a list element and
returns a comparison key for the
element. The list is then sorted using
the comparison keys.
also, in the source code, line 1792:
/* Special wrapper to support stable sorting using the decorate-sort-undecorate
pattern. Holds a key which is used for comparisons and the original record
which is returned during the undecorate phase. By exposing only the key
.... */
This means that your regex pattern is only evaluated once for each entry (not once for each compare), hence it should not be too expensive to do:
sorted_generator = sorted(filtered, key=regexp.match(line).group(1))
I personally like your code as is since you consider "fruit=COLOR" as a pattern which others does not. I think you want to find some solution like memoization which enables you to skip test for already solved problem but this is not the case I guess.
def any_it(iterable):
for element in iterable:
if element: return True
return False
necessary = ['YELLOW', 'GREEN', 'RED', ...]
predicate = lambda line: any_it("fruit=" + color in line for color in necessary)
filtered = ifilter(predicate, open("testest"))
Tested (but unbenchmarked) code:
import re
import fileinput
regexp = re.compile(r'^.*?fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ]
filtered = []
for line in fileinput.input(["test.txt"]):
try:
key = regexp.match(line).group(1)
except AttributeError:
continue # no match
for p in necessary:
if key.startswith(p):
filtered.append(line)
break
# "filtered" now holds your results
print "".join(filtered)
Diff to code in question:
We do not first load the whole file into memory (as is done when you use file.readlines()). Instead, we process each line as the file is read in. I use the fileinput module here for brevity, but one can also use line = file.readline() and a while line: loop.
We stop iterating through the necessary list once a match is found.
We modified the regex pattern and use re.match instead of re.findall. That's assuming that each line would only contain one "fruit=..." entry.
update
If the format of the input file is consistent, you can squeeze out a little more performance by getting rid of regex altogether.
try:
# with line = "2 asdasd fruit=SOMETHING asdasd...."
key = line.split(" ", 3)[2].split("=")[1]
except:
continue # no match
filtered=[]
for line in open('huge_file'):
found=regexp.findall(line)
if found:
fruit=found[0]
for x in necessary:
if fruit.startswith(x):
filtered.append(line)
break
or maybe :
necessary=['fruit=%s'%x for x in necessary]
filtered=[]
for line in open('huge_file'):
for x in necessary:
if x in line:
filtered.append(line)
break
I'd make a simple list of ['fruit=RED','fruit=GREEN'... etc. with ['fruit='+n for n in necessary], then use in rather than a regex to test them. I don't think there's any way to do it really quickly, though.
filtered = (line for line in f if any(a in line for a in necessary_simple))
(The any() function is doing the same thing as your any_it() function)
Oh, and get rid of file.readlines(), just iterate over the file.
Untested code:
filtered = []
for line in lines:
value = line.split('=', 1)[1].split(' ',1)[0]
if value not in necessary:
filtered.append(line)
That should be faster than pattern matching 10 000 patterns onto a line.
Possibly there are even faster ways. :)
It shouldn't take too long to iterate through 100,000 strings, but I see you have a 10,000 strings list, which means you iterate 10,000 * 100,000 = 1,000,000,000 times the strings, so I don't know what did you expect...
As for your question, if you encounter a word from the list and you only need 1 or more (if you want exacly 1 you need to iterate through the whole list) you can skip the rest, it should optimize the search operation.