Hashtable/dictionary/map lookup with regular expressions

Hashtable/dictionary/map lookup with regular expressions - python

I'm trying to figure out if there's a reasonably efficient way to perform a lookup in a dictionary (or a hash, or a map, or whatever your favorite language calls it) where the keys are regular expressions and strings are looked up against the set of keys. For example (in Python syntax):
>>> regex_dict = { re.compile(r'foo.') : 12, re.compile(r'^FileN.*$') : 35 }
>>> regex_dict['food']
12
>>> regex_dict['foot in my mouth']
12
>>> regex_dict['FileNotFoundException: file.x does not exist']
35
(Obviously the above example won't work as written in Python, but that's the sort of thing I'd like to be able to do.)
I can think of a naive way to implement this, in which I iterate over all of the keys in the dictionary and try to match the passed in string against them, but then I lose the O(1) lookup time of a hash map and instead have O(n), where n is the number of keys in my dictionary. This is potentially a big deal, as I expect this dictionary to grow very large, and I will need to search it over and over again (actually I'll need to iterate over it for every line I read in a text file, and the files can be hundreds of megabytes in size).
Is there a way to accomplish this, without resorting to O(n) efficiency?
Alternatively, if you know of a way to accomplish this sort of a lookup in a database, that would be great, too.
(Any programming language is fine -- I'm using Python, but I'm more interested in the data structures and algorithms here.)
Someone pointed out that more than one match is possible, and that's absolutely correct. Ideally in this situation I'd like to return a list or tuple containing all of the matches. I'd settle for the first match, though.
I can't see O(1) being possible in that scenario; I'd settle for anything less than O(n), though. Also, the underlying data structure could be anything, but the basic behavior I'd like is what I've written above: lookup a string, and return the value(s) that match the regular expression keys.

What you want to do is very similar to what is supported by xrdb. They only support a fairly minimal notion of globbing however.
Internally you can implement a larger family of regular languages than theirs by storing your regular expressions as a character trie.
single characters just become trie nodes.
.'s become wildcard insertions covering all children of the current trie node.
*'s become back links in the trie to node at the start of the previous item.
[a-z] ranges insert the same subsequent child nodes repeatedly under each of the characters in the range. With care, while inserts/updates may be somewhat expensive the search can be linear in the size of the string. With some placeholder stuff the common combinatorial explosion cases can be kept under control.
(foo)|(bar) nodes become multiple insertions
This doesn't handle regexes that occur at arbitrary points in the string, but that can be modeled by wrapping your regex with .* on either side.
Perl has a couple of Text::Trie -like modules you can raid for ideas. (Heck I think I even wrote one of them way back when)

This is not possible to do with a regular hash table in any language. You'll either have to iterate through the entire keyset, attempting to match the key to your regex, or use a different data structure.
You should choose a data structure that is appropriate to the problem you're trying to solve. If you have to match against any arbitrary regular expression, I don't know of a good solution. If the class of regular expressions you'll be using is more restrictive, you might be able to use a data structure such as a trie or suffix tree.

In the general case, what you need is a lexer generator. It takes a bunch of regular expressions and compiles them into a recognizer. "lex" will work if you are using C. I have never used a lexer generator in Python, but there seem to be a few to choose from. Google shows PLY, PyGgy and PyLexer.
If the regular expressions all resemble each other in some way, then you may be able to take some shortcuts. We would need to know more about the ultimate problem that you are trying to solve in order to come up with any suggestions. Can you share some sample regular expressions and some sample data?
Also, how many regular expressions are you dealing with here? Are you sure that the naive approach won't work? As Rob Pike once said, "Fancy algorithms are slow when n is small, and n is usually small." Unless you have thousands of regular expressions, and thousands of things to match against them, and this is an interactive application where a user is waiting for you, you may be best off just doing it the easy way and looping through the regular expressions.

This is definitely possible, as long as you're using 'real' regular expressions. A textbook regular expression is something that can be recognized by a deterministic finite state machine, which primarily means you can't have back-references in there.
There's a property of regular languages that "the union of two regular languages is regular", meaning that you can recognize an arbitrary number of regular expressions at once with a single state machine. The state machine runs in O(1) time with respect to the number of expressions (it runs in O(n) time with respect to the length of the input string, but hash tables do too).
Once the state machine completes you'll know which expressions matched, and from there it's easy to look up values in O(1) time.

What happens if you have a dictionary such as
regex_dict = { re.compile("foo.*"): 5, re.compile("f.*"): 6 }
In this case regex_dict["food"] could legitimately return either 5 or 6.
Even ignoring that problem, there's probably no way to do this efficiently with the regex module. Instead, what you'd need is an internal directed graph or tree structure.

What about the following:
class redict(dict):
def __init__(self, d):
dict.__init__(self, d)
def __getitem__(self, regex):
r = re.compile(regex)
mkeys = filter(r.match, self.keys())
for i in mkeys:
yield dict.__getitem__(self, i)
It's basically a subclass of the dict type in Python. With this you can supply a regular expression as a key, and the values of all keys that match this regex are returned in an iterable fashion using yield.
With this you can do the following:
>>> keys = ["a", "b", "c", "ab", "ce", "de"]
>>> vals = range(0,len(keys))
>>> red = redict(zip(keys, vals))
>>> for i in red[r"^.e$"]:
... print i
...
5
4
>>>

Here's an efficient way to do it by combining the keys into a single compiled regexp, and so not requiring any looping over key patterns. It abuses the lastindex to find out which key matched. (It's a shame regexp libraries don't let you tag the terminal state of the DFA that a regexp is compiled to, or this would be less of a hack.)
The expression is compiled once, and will produce a fast matcher that doesn't have to search sequentially. Common prefixes are compiled together in the DFA, so each character in the key is matched once, not many times, unlike some of the other suggested solutions. You're effectively compiling a mini lexer for your keyspace.
This map isn't extensible (can't define new keys) without recompiling the regexp, but it can be handy for some situations.
# Regular expression map
# Abuses match.lastindex to figure out which key was matched
# (i.e. to emulate extracting the terminal state of the DFA of the regexp engine)
# Mostly for amusement.
# Richard Brooksby, Ravenbrook Limited, 2013-06-01
import re
class ReMap(object):
def __init__(self, items):
if not items:
items = [(r'epsilon^', None)] # Match nothing
key_patterns = []
self.lookup = {}
index = 1
for key, value in items:
# Ensure there are no capturing parens in the key, because
# that would mess up match.lastindex
key_patterns.append('(' + re.sub(r'\((?!\?:)', '(?:', key) + ')')
self.lookup[index] = value
index += 1
self.keys_re = re.compile('|'.join(key_patterns))
def __getitem__(self, key):
m = self.keys_re.match(key)
if m:
return self.lookup[m.lastindex]
raise KeyError(key)
if __name__ == '__main__':
remap = ReMap([(r'foo.', 12), (r'FileN.*', 35)])
print remap['food']
print remap['foot in my mouth']
print remap['FileNotFoundException: file.x does not exist']

There is a Perl module that does just this Tie::Hash::Regex.
use Tie::Hash::Regex;
my %h;
tie %h, 'Tie::Hash::Regex';
$h{key} = 'value';
$h{key2} = 'another value';
$h{stuff} = 'something else';
print $h{key}; # prints 'value'
print $h{2}; # prints 'another value'
print $h{'^s'}; # prints 'something else'
print tied(%h)->FETCH(k); # prints 'value' and 'another value'
delete $h{k}; # deletes $h{key} and $h{key2};

#rptb1 you don't have to avoid capturing groups, because you can use re.groups to count them. Like this:
# Regular expression map
# Abuses match.lastindex to figure out which key was matched
# (i.e. to emulate extracting the terminal state of the DFA of the regexp engine)
# Mostly for amusement.
# Richard Brooksby, Ravenbrook Limited, 2013-06-01
import re
class ReMap(object):
def __init__(self, items):
if not items:
items = [(r'epsilon^', None)] # Match nothing
self.re = re.compile('|'.join('('+k+')' for (k,v) in items))
self.lookup = {}
index = 1
for key, value in items:
self.lookup[index] = value
index += re.compile(key).groups + 1
def __getitem__(self, key):
m = self.re.match(key)
if m:
return self.lookup[m.lastindex]
raise KeyError(key)
def test():
remap = ReMap([(r'foo.', 12),
(r'.*([0-9]+)', 99),
(r'FileN.*', 35),
])
print remap['food']
print remap['foot in my mouth']
print remap['FileNotFoundException: file.x does not exist']
print remap['there were 99 trombones']
print remap['food costs $18']
print remap['bar']
if __name__ == '__main__':
test()
Sadly very few RE engines actually compile the regexps down to machine code, although it's not especially hard to do. I suspect there's an order of magnitude performance improvement waiting for someone to make a really good RE JIT library.

As other respondents have pointed out, it's not possible to do this with a hash table in constant time.
One approximation that might help is to use a technique called "n-grams". Create an inverted index from n-character chunks of a word to the entire word. When given a pattern, split it into n-character chunks, and use the index to compute a scored list of matching words.
Even if you can't accept an approximation, in most cases this would still provide an accurate filtering mechanism so that you don't have to apply the regex to every key.

A special case of this problem came up in the 70s AI languages oriented around deductive databases. The keys in these databases could be patterns with variables -- like regular expressions without the * or | operators. They tended to use fancy extensions of trie structures for indexes. See krep*.lisp in Norvig's Paradigms of AI Programming for the general idea.

If you have a small set of possible inputs, you can cache the matches as they appear in a second dict and get O(1) for the cached values.
If the set of possible inputs is too big to cache but not infinite, either, you can just keep the last N matches in the cache (check Google for "LRU maps" - least recently used).
If you can't do this, you can try to chop down the number of regexps you have to try by checking a prefix or somesuch.

I created this exact data structure for a project once. I implemented it naively, as you suggested. I did make two immensely helpful optimizations, which may or may not be feasible for you, depending on the size of your data:
Memoizing the hash lookups
Pre-seeding the the memoization table (not sure what to call this... warming up the cache?)
To avoid the problem of multiple keys matching the input, I gave each regex key a priority and the highest priority was used.

The fundamental assumption is flawed, I think. you can't map hashes to regular expressions.

I don't think it's even theoretically possible. What happens if someone passes in a string that matches more than 1 regular expression.
For example, what would happen if someone did:
>>> regex_dict['FileNfoo']
How can something like that possibly be O(1)?

It may be possible to get the regex compiler to do most of the work for you by concatenating the search expressions into one big regexp, separated by "|". A clever regex compiler might search for commonalities in the alternatives in such a case, and devise a more efficient search strategy than simply checking each one in turn. But I have no idea whether there are compilers which will do that.

It really depends on what these regexes look like. If you don't have a lot regexes that will match almost anything like '.*' or '\d+', and instead you have regexes that contains mostly words and phrases or any fixed patterns longer than 4 characters (e.g.'a*b*c' in ^\d+a\*b\*c:\s+\w+) , as in your examples. You can do this common trick that scales well to millions of regexes:
Build a inverted index for the regexes (rabin-karp-hash('fixed pattern') -> list of regexes containing 'fixed pattern'). Then at matching time, using Rabin-Karp hashing to compute sliding hashes and look up the inverted index, advancing one character at a time. You now have O(1) look-up for inverted-index non-matches and a reasonable O(k) time for matches, k is the average length of the lists of regexes in the inverted index. k can be quite small (less than 10) for many applications. The quality (false positive means bigger k, false negative means missed matches) of the inverted index depends on how well the indexer understands the regex syntax. If the regexes are generated by human experts, they can provide hints for contained fixed patterns as well.

Ok, I have a very similar requirements, I have a lot of lines of different syntax, basically remark lines and lines with some codes for to use in a process of smart-card format, also, descriptor lines of keys and secret codes, in every case, I think that the "model" pattern/action is the beast approach for to recognize and to process a lot of lines.
I'm using C++/CLI for to develop my assembly named LanguageProcessor.dll, the core of this library is a lex_rule class that basically contains :
a Regex member
an event member
The constructor loads the regex string and call the necessary codes for to build the event on the fly using DynamicMethod, Emit and Reflexion... also into the assembly exists other class like meta and object that constructs ans instantiates the objects by the simple names of the publisher and the receiver class, receiver class provides the action handlers for each rule matched.
Late, I have a class named fasterlex_engine that build a Dictionary<Regex, action_delegate>
that load the definitions from an array for to run.
The project is in advanced point but I'm still building, today. I will try to enhance the performance of running surrounding the sequential access to every pair foreach line input, thru using some mechanism of lookup the dictionary directly using the regexp like:
map_rule[gcnew Regex("[a-zA-Z]")];
Here, some of segments of my code:
public ref class lex_rule: ILexRule
{
private:
Exception ^m_exception;
Regex ^m_pattern;
//BACKSTORAGE delegates, esto me lo aprendi asiendo la huella.net de m*e*da JEJE
yy_lexical_action ^m_yy_lexical_action;
yy_user_action ^m_yy_user_action;
public:
virtual property String ^short_id;
private:
void init(String ^_short_id, String ^well_formed_regex);
public:
lex_rule();
lex_rule(String ^_short_id,String ^well_formed_regex);
virtual event yy_lexical_action ^YY_RULE_MATCHED
{
virtual void add(yy_lexical_action ^_delegateHandle)
{
if(nullptr==m_yy_lexical_action)
m_yy_lexical_action=_delegateHandle;
}
virtual void remove(yy_lexical_action ^)
{
m_yy_lexical_action=nullptr;
}
virtual long raise(String ^id_rule, String ^input_string, String ^match_string, int index)
{
long lReturn=-1L;
if(m_yy_lexical_action)
lReturn=m_yy_lexical_action(id_rule,input_string, match_string, index);
return lReturn;
}
}
};
Now the fasterlex_engine class that execute a lot of pattern/action pair:
public ref class fasterlex_engine
{
private:
Dictionary<String^,ILexRule^> ^m_map_rules;
public:
fasterlex_engine();
fasterlex_engine(array<String ^,2>^defs);
Dictionary<String ^,Exception ^> ^load_definitions(array<String ^,2> ^defs);
void run();
};
AND FOR TO DECORATE THIS TOPIC..some code of my cpp file:
this code creates a constructor invoker by parameter sign
inline Exception ^object::builder(ConstructorInfo ^target, array<Type^> ^args)
{
try
{
DynamicMethod ^dm=gcnew DynamicMethod(
"dyna_method_by_totem_motorist",
Object::typeid,
args,
target->DeclaringType);
ILGenerator ^il=dm->GetILGenerator();
il->Emit(OpCodes::Ldarg_0);
il->Emit(OpCodes::Call,Object::typeid->GetConstructor(Type::EmptyTypes)); //invoca a constructor base
il->Emit(OpCodes::Ldarg_0);
il->Emit(OpCodes::Ldarg_1);
il->Emit(OpCodes::Newobj, target); //NewObj crea el objeto e invoca al constructor definido en target
il->Emit(OpCodes::Ret);
method_handler=(method_invoker ^) dm->CreateDelegate(method_invoker::typeid);
}
catch (Exception ^e)
{
return e;
}
return nullptr;
}
This code attach an any handler function (static or not) for to deal with a callback raised by matching of a input string
Delegate ^connection_point::hook(String ^receiver_namespace,String ^receiver_class_name, String ^handler_name)
{
Delegate ^d=nullptr;
if(connection_point::waitfor_hook<=m_state) // si es 0,1,2 o mas => intenta hookear
{
try
{
Type ^tmp=meta::_class(receiver_namespace+"."+receiver_class_name);
m_handler=tmp->GetMethod(handler_name);
m_receiver_object=Activator::CreateInstance(tmp,false);
d=m_handler->IsStatic?
Delegate::CreateDelegate(m_tdelegate,m_handler):
Delegate::CreateDelegate(m_tdelegate,m_receiver_object,m_handler);
m_add_handler=m_connection_point->GetAddMethod();
array<Object^> ^add_handler_args={d};
m_add_handler->Invoke(m_publisher_object, add_handler_args);
++m_state;
m_exception_flag=false;
}
catch(Exception ^e)
{
m_exception_flag=true;
throw gcnew Exception(e->ToString()) ;
}
}
return d;
}
finally the code that call the lexer engine:
array<String ^,2> ^defs=gcnew array<String^,2> {/* shortID pattern namespc clase fun*/
{"LETRAS", "[A-Za-z]+" ,"prueba", "manejador", "procesa_directriz"},
{"INTS", "[0-9]+" ,"prueba", "manejador", "procesa_comentario"},
{"REM", "--[^\\n]*" ,"prueba", "manejador", "nullptr"}
}; //[3,5]
//USO EL IDENTIFICADOR ESPECIAL "nullptr" para que el sistema asigne el proceso del evento a un default que realice nada
fasterlex_engine ^lex=gcnew fasterlex_engine();
Dictionary<String ^,Exception ^> ^map_error_list=lex->load_definitions(defs);
lex->run();

The problem has nothing to do with regular expressions - you'd have the same problem with a dictionary with keys as functions of lambdas. So the problem you face is figuring is there a way of classifying your functions to figure which will return true or not and that isn't a search problem because f(x) is not known in general before hand.
Distributed programming or caching answer sets assuming there are common values of x may help.
-- DM

Related

Fast String "Startswith" Matching for Dict like object

I currently have some code which needs to be very performant, where I am essentially doing a string dictionary key lookup:
class Foo:
def __init__(self):
self.fast_lookup = {"a": 1, "b": 2}
def bar(self, s):
return self.fast_lookup[s]
self.fast_lookup has O(1) lookup time, and there is no try/if etc code that would slow down the lookup
Is there anyway to retain this speed while doing a "startswith" lookup instead? With the code above calling bar on s="az" would result in a key error, if it were changed to a "startswith" implementation then it would return 1.
NB: I am well aware how I could do this with a regex/startswith statement, I am looking for performance specifically for startswith dict lookup

An efficient way to do this would be to use the pyahocorasick module to construct a trie with the possible keys to match, then use the longest_prefix method to determine how much of a given string matches. If no "key" matched, it returns 0, otherwise it will say how much of the string passed exists in the automata.
After installing pyahocorasick, it would look something like:
import ahocorasick
class Foo:
def __init__(self):
self.fast_lookup = ahocorasick.Automaton()
for k, v in {"a": 1, "b": 2}.items():
self.fast_lookup.add_word(k, v)
def bar(self, s):
index = self.fast_lookup.longest_prefix(s)
if not index: # No prefix match at all
raise KeyError(s)
return self.fast_lookup.get(s[:index])
If it turns out the longest prefix doesn't actually map to a value (say, 'cat' is mapped, but you're looking up 'cab', and no other entry actually maps 'ca' or 'cab'), this will die with a KeyError. Tweak as needed to achieve precise behavior desired (you might need to use longest_prefix as a starting point and try to .get() for all substrings of that length or less until you get a hit for instance).
Note that this isn't the primary purpose of Aho-Corasick (it's an efficient way to search for many fixed strings in one or more long strings in a single pass), but tries as a whole are an efficient way to deal with prefix search of this form, and Aho-Corasick is implemented in terms of tries and provides most of the useful features of tries to make it more broadly useful (as in this case).

I dont fully understand the question, but what I would do is try and think of ways to reduce the work the lookup even has to do. If you know the basic searches the startswith is going to do, you can just add those as keys to the dictionary and values that point to the same object. Your dict will get pretty big pretty fast, however it will greatly reduce the lookup i believe. So maybe for a more dynamic method you can add dict keys for the first groups of letters up to three for each entry.
Without activly storing the references for each search, your code will always need to get each dict objects value until it gets one that matches. You cannot reduce that.

python rply reverse parser

I'm using rply and Python3.6 to create a lexer and a parser for a little privat project.
But what I noticed is that the parser appears to flip the order of the lexerstream.
This is the file I'm parsing:
let test:string = "test";
print(test);
Lexer output:
Token('LET', 'let')
Token('NAME', 'test')
Token('COLON', ':')
Token('NAME', 'string')
Token('EQUALS', '=')
Token('STRING', '"test"')
Token('SEMI_COLON', ';')
Token('PRINT', 'print')
Token('OPEN_PARENS', '(')
Token('STRING', '"test"')
Token('CLOSE_PARENS', ')')
Token('SEMI_COLON', ';')
As you can see it is in the order of the script.
I use the parser to create a variable with name test, type string and value test. Then I want to print the variable.
It does create the variable but when I want to print it out, there is nothing.
But when I flip the script like this
print(test);
let test:string = "test";
it is able to print the value correctly.
The two parser 'rules' look like this:
Print:
#self.pg.production('expression : PRINT OPEN_PARENS expression CLOSE_PARENS SEMI_COLON expression')
def print_s(p):
...
Create variable:
#self.pg.production('expression : LET expression COLON expression EQUALS expression SEMI_COLON expression')
def create_var(p):
...
So my question is: How can I flip the order in which the content is parsed?
Edit: I looked for similar questions or problems and also in the documentation but didn't find anything.

Here's a somewhat simpler example; hopefully, you can see the pattern.
The key insight is that reduction actions (that is, the parser functions) are executed when the production's match has been fully parsed. That means that if a production contains non-terminals, the actions for those non-terminals are executed before the action for the whole production.
It should be clear why this is true. Every production action depends on the semantic values of all of the components, and in the case of non-terminals those values are produced by running the corresponding actions.
Now, consider these two very similar ways to parse a list of things. In both cases, we assume there is a base production which recognises an empty list (list :) and does nothing.
Right recursion:
list : thing list
Left recursion:
list : list thing
In both cases, the action prints the thing, which is p[0] in the right-recursive case, and p[1] in the left-recursive one.
The right-recursive production will cause the things to be printed in reverse order, because printing the thing doesn't happen until after the internal list is parsed (and it's components are printed).
But the left-recursive production will print the things in left-to-right order, for the same reason. The difference is tgat in the left-recursive case, the internal (recursive) list contains the initial things while in the right-recursive case, the list contains the final things.
If you were just building a Python list of things, this probably wouldn't matter much, since execution order wouldn't be important. It's only visible in this example because the action has a side-effect (printing a value), which makes execution order visible.
There are other techniques to order actions, in the rare cases where it is really necessary. But best practice is to always use left-recursion whenever it is syntactically practical. Left-recursive parsers are more efficient because the parser doesn't need to accumulate a stack of incomplete productions. And left-recursion is often better for your actions as well.
Here, for example, the left-recursive action could append the new value (p[0].append(p[1]); return p[0]), while the right-recursive action needs to create a new list (return [p[0] + p[1]). Since repeated appending is on average linear time while repeated concatenation is quadratic, the left-recursive parser is more scalable for large lists.

C data structures

Is there a C data structure equatable to the following python structure?
data = {'X': 1, 'Y': 2}
Basically I want a structure where I can give it an pre-defined string and have it come out with an integer.

The data-structure you are looking for is called a "hash table" (or "hash map"). You can find the source code for one here.
A hash table is a mutable mapping of an integer (usually derived from a string) to another value, just like the dict from Python, which your sample code instantiates.
It's called a "hash table" because it performs a hash function on the string to return an integer result, and then directly uses that integer to point to the address of your desired data.
This system makes it extremely extremely quick to access and change your information, even if you have tons of it. It also means that the data is unordered because a hash function returns a uniformly random result and puts your data unpredictable all over the map (in a perfect world).

Also note that if you're doing a quick one-off hash, like a two or three static hash for some lookup: look at gperf, which generates a perfect hash function and generates simple code for that hash.

The above data structure is a dict type.
In C/C++ paralance, a hashmap should be equivalent, Google for hashmap implementation.

There's nothing built into the language or standard library itself but, depending on your requirements, there are a number of ways to do it.
If the data set will remain relatively small, the easiest solution is to probably just have an array of structures along the lines of:
typedef struct {
char *key;
int val;
} tElement;
then use a sequential search to look them up. Have functions which insert keys, delete keys and look up keys so that, if you need to change it in future, the API itself won't change. Pseudo-code:
def init:
create g.key[100] as string
create g.val[100] as integer
set g.size to 0
def add (key,val):
if lookup(key) != not_found:
return already_exists
if g.size == 100:
return no_space
g.key[g.size] = key
g.val[g.size] = val
g.size = g.size + 1
return okay
def del (key):
pos = lookup (key)
if pos == not_found:
return no_such_key
if pos < g.size - 1:
g.key[pos] = g.key[g.size-1]
g.val[pos] = g.val[g.size-1]
g.size = g.size - 1
def find (key):
for pos goes from 0 to g.size-1:
if g.key[pos] == key:
return pos
return not_found
Insertion means ensuring it doesn't already exist then just tacking an element on to the end (you'll maintain a separate size variable for the structure). Deletion means finding the element then simply overwriting it with the last used element and decrementing the size variable.
Now this isn't the most efficient method in the world but you need to keep in mind that it usually only makes a difference as your dataset gets much larger. The difference between a binary tree or hash and a sequential search is irrelevant for, say, 20 entries. I've even used bubble sort for small data sets where a more efficient one wasn't available. That's because it massively quick to code up and the performance is irrelevant.
Stepping up from there, you can remove the fixed upper size by using a linked list. The search is still relatively inefficient since you're doing it sequentially but the same caveats apply as for the array solution above. The cost of removing the upper bound is a slight penalty for insertion and deletion.
If you want a little more performance and a non-fixed upper limit, you can use a binary tree to store the elements. This gets rid of the sequential search when looking for keys and is suited to somewhat larger data sets.
If you don't know how big your data set will be getting, I would consider this the absolute minimum.
A hash is probably the next step up from there. This performs a function on the string to get a bucket number (usually treated as an array index of some sort). This is O(1) lookup but the aim is to have a hash function that only allocates one item per bucket, so that no further processing is required to get the value.
A degenerate case of "all items in the same bucket" is no different to an array or linked list.
For maximum performance, and assuming the keys are fixed and known in advance, you can actually create your own hashing function based on the keys themselves.
Knowing the keys up front, you have extra information that allows you to fully optimise a hashing function to generate the actual value so you don't even involve buckets - the value generated by the hashing function can be the desired value itself rather than a bucket to get the value from.
I had to put one of these together recently for converting textual months ("January", etc) in to month numbers. You can see the process here.
I mention this possibility because of your "pre-defined string" comment. If your keys are limited to "X" and "Y" (as in your example) and you're using a character set with contiguous {W,X,Y} characters (which even covers EBCDIC as well as ASCII though not necessarily every esoteric character set allowed by ISO), the simplest hashing function would be:
char *s = "X";
int val = *s - 'W';
Note that this doesn't work well if you feed it bad data. These are ideal for when the data is known to be restricted to certain values. The cost of checking data can often swamp the saving given by a pre-optimised hash function like this.

C doesn't have any collection classes. C++ has std::map.
You might try searching for C implementations of maps, e.g. http://elliottback.com/wp/hashmap-implementation-in-c/

A 'trie' or a 'hasmap' should do. The simplest implementation is an array of struct { char *s; int i }; pairs.
Check out 'trie' in 'include/nscript.h' and 'src/trie.c' here: http://github.com/nikki93/nscript . Change the 'trie_info' type to 'int'.

Try a Trie for strings, or a Tree of some sort for integer/pointer types (or anything that can be compared as "less than" or "greater than" another key). Wikipedia has reasonably good articles on both, and they can be implemented in C.

Python: Set with only existence check?

I have a set of lots of big long strings that I want to do existence lookups for. I don't need the whole string ever to be saved. As far as I can tell, the set() actually stored the string which is eating up a lot of my memory.
Does such a data structure exist?
done = hash_only_set()
while len(queue) > 0 :
item = queue.pop()
if item not in done :
process(item)
done.add(item)
(My queue is constantly being filled by other threads so I have no way of dedupping it at the start).

It's certainly possible to keep a set of only hashes:
done = set()
while len(queue) > 0 :
item = queue.pop()
h = hash(item)
if h not in done :
process(item)
done.add(h)
Notice that because of hash collisions, there is a chance that you consider an item done even though it isn't.
If you cannot accept this risk, you really need to save the full strings to be able to tell whether you have seen it before. Alternatively: perhaps the processing itself would be able to tell?
Yet alternatively: if you cannot accept to keep the strings in memory, keep them in a database, or create files in a directory with the same name as the string.

You can use a data structure called Bloom Filter specifically for this purpose. A Python implementation can be found here.
EDIT: Important notes:
False positives are possible in this data structure, i.e. a check for the existence of a string could return a positive result even though it was not stored.
False negatives (getting a negative result for a string that was stored) are not possible.
That said, the chances of this happening can be brought to a minimum if used properly and so I consider this data structure to be very useful.

If you use a secure (like SHA-256, found in the hashlib module) hash function to hash the strings, it's very unlikely that you would found duplicate (and if you find some you can probably win a prize as with most cryptographic hash functions).
The builtin __hash__() method does not guarantee you won't have duplicates (and since it only uses 32 bits, it's very likely you'll find some).

You need to know the whole string to have 100% certainty. If you have lots of strings with similar prefixes you could save space by using a trie to store the strings. If your strings are long you could also save space by using a large hash function like SHA-1 to make the possibility of hash collisions so remote as to be irrelevant.
If you can make the process() function idempotent - i.e. having it called twice on an item is only a performance issue, then the problem becomes a lot simpler and you can use lossy datastructures, such as bloom filters.

You would have to think about how to do the lookup, since there are two methods that the set needs, __hash__ and __eq__.
The hash is a "loose part" that you can take away, but the __eq__ is not a loose part that you can save; you have to have two strings for the comparison.
If you only need negative confirmation (this item is not part of the set), you could fill a Set collection you implemented yourself with your strings, then you "finalize" the set by removing all strings, except those with collisions (those are kept around for eq tests), and you promise not to add more objects to your Set. Now you have an exclusive test available.. you can tell if an object is not in your Set. You can't be certain if "obj in Set == True" is a false positive or not.
Edit: This is basically a bloom filter that was cleverly linked, but a bloom filter might use more than one hash per element which is really clever.
Edit2: This is my 3-minute bloom filter:
class BloomFilter (object):
"""
Let's make a bloom filter
http://en.wikipedia.org/wiki/Bloom_filter
__contains__ has false positives, but never false negatives
"""
def __init__(self, hashes=(hash, )):
self.hashes = hashes
self.data = set()
def __contains__(self, obj):
return all((h(obj) in self.data) for h in self.hashes)
def add(self, obj):
self.data.update(h(obj) for h in self.hashes)

As has been hinted already, if the answers offered here (most of which break down in the face of hash collisions) are not acceptable you would need to use a lossless representation of the strings.
Python's zlib module provides built-in string compression capabilities and could be used to pre-process the strings before you put them in your set. Note however that the strings would need to be quite long (which you hint that they are) and have minimal entropy in order to save much memory space. Other compression options might provide better space savings and some Python based implementations can be found here

Pythonic way to implement a tokenizer

I'm going to implement a tokenizer in Python and I was wondering if you could offer some style advice?
I've implemented a tokenizer before in C and in Java so I'm fine with the theory, I'd just like to ensure I'm following pythonic styles and best practices.
Listing Token Types:
In Java, for example, I would have a list of fields like so:
public static final int TOKEN_INTEGER = 0
But, obviously, there's no way (I think) to declare a constant variable in Python, so I could just replace this with normal variable declarations but that doesn't strike me as a great solution since the declarations could be altered.
Returning Tokens From The Tokenizer:
Is there a better alternative to just simply returning a list of tuples e.g.
[ (TOKEN_INTEGER, 17), (TOKEN_STRING, "Sixteen")]?
Cheers,
Pete

There's an undocumented class in the re module called re.Scanner. It's very straightforward to use for a tokenizer:
import re
scanner=re.Scanner([
(r"[0-9]+", lambda scanner,token:("INTEGER", token)),
(r"[a-z_]+", lambda scanner,token:("IDENTIFIER", token)),
(r"[,.]+", lambda scanner,token:("PUNCTUATION", token)),
(r"\s+", None), # None == skip token.
])
results, remainder=scanner.scan("45 pigeons, 23 cows, 11 spiders.")
print results
will result in
[('INTEGER', '45'),
('IDENTIFIER', 'pigeons'),
('PUNCTUATION', ','),
('INTEGER', '23'),
('IDENTIFIER', 'cows'),
('PUNCTUATION', ','),
('INTEGER', '11'),
('IDENTIFIER', 'spiders'),
('PUNCTUATION', '.')]
I used re.Scanner to write a pretty nifty configuration/structured data format parser in only a couple hundred lines.

Python takes a "we're all consenting adults" approach to information hiding. It's OK to use variables as though they were constants, and trust that users of your code won't do something stupid.

In many situations, exp. when parsing long input streams, you may find it more useful to implement you tokenizer as a generator function. This way you can easily iterate over all the tokens without the need for lots of memory to build the list of tokens first.
For generator see the original proposal or other online docs

Thanks for your help, I've started to bring these ideas together, and I've come up with the following. Is there anything terribly wrong with this implementation (particularly I'm concerned about passing a file object to the tokenizer):
class Tokenizer(object):
def __init__(self,file):
self.file = file
def __get_next_character(self):
return self.file.read(1)
def __peek_next_character(self):
character = self.file.read(1)
self.file.seek(self.file.tell()-1,0)
return character
def __read_number(self):
value = ""
while self.__peek_next_character().isdigit():
value += self.__get_next_character()
return value
def next_token(self):
character = self.__peek_next_character()
if character.isdigit():
return self.__read_number()

"Is there a better alternative to just simply returning a list of tuples?"
Nope. It works really well.

"Is there a better alternative to just simply returning a list of tuples?"
That's the approach used by the "tokenize" module for parsing Python source code. Returning a simple list of tuples can work very well.

I have recently built a tokenizer, too, and passed through some of your issues.
Token types are declared as "constants", i.e. variables with ALL_CAPS names, at the module level. For example,
_INTEGER = 0x0007
_FLOAT = 0x0008
_VARIABLE = 0x0009
and so on. I have used an underscore in front of the name to point out that somehow those fields are "private" for the module, but I really don't know if this is typical or advisable, not even how much Pythonic. (Also, I'll probably ditch numbers in favour of strings, because during debugging they are much more readable.)
Tokens are returned as named tuples.
from collections import namedtuple
Token = namedtuple('Token', ['value', 'type'])
# so that e.g. somewhere in a function/method I can write...
t = Token(n, _INTEGER)
# ...and return it properly
I have used named tuples because the tokenizer's client code (e.g. the parser) seems a little clearer while using names (e.g. token.value) instead of indexes (e.g. token[0]).
Finally, I've noticed that sometimes, especially writing tests, I prefer to pass a string to the tokenizer instead of a file object. I call it a "reader", and have a specific method to open it and let the tokenizer access it through the same interface.
def open_reader(self, source):
"""
Produces a file object from source.
The source can be either a file object already, or a string.
"""
if hasattr(source, 'read'):
return source
else:
from io import StringIO
return StringIO(source)

When I start something new in Python I usually look first at some modules or libraries to use. There's 90%+ chance that there already is somthing available.
For tokenizers and parsers this is certainly so. Have you looked at PyParsing ?

I've implemented a tokenizer for a C-like programming language. What I did was to split up the creation of tokens into two layers:
a surface scanner: This one actually reads the text and uses regular expression to split it up into only the most primitve tokens (operators, identifiers, numbers,...); this one yields tuples (tokenname, scannedstring, startpos, endpos).
a tokenizer: This consumes the tuples from the first layer, turning them into token objects (named tuples would do as well, I think). Its purpose is to detect some long-range dependencies in the token stream, particularly strings (with their opening and closing quotes) and comments (with their opening an closing lexems; - yes, I wanted to retain comments!) and coerce them into single tokens. The resulting stream of token objects is then returned to a consuming parser.
Both are generators. The benefits of this approach were:
Reading of the raw text is done only in the most primitive way, with simple regexps - fast and clean.
The second layer is already implemented as a primitive parser, to detect string literals and comments - re-use of parser technology.
You don't have to strain the surface scanner with complex detections.
But the real parser gets tokens on the semantic level of the language to be parsed (again strings, comments).
I feel quite happy with this layered approach.

I'd turn to the excellent Text Processing in Python by David Mertz

This being a late answer, there is now something in the official documentation: Writing a tokenizer with the re standard library. This is content in the Python 3 documentation that isn't in the Py 2.7 docs. But it is still applicable to older Pythons.
This includes both short code, easy setup, and writing a generator as several answers here have proposed.
If the docs are not Pythonic, I don't know what is :-)

"Is there a better alternative to just simply returning a list of tuples"
I had to implement a tokenizer, but it required a more complex approach than a list of tuples, therefore I implemented a class for each token. You can then return a list of class instances, or if you want to save resources, you can return something implementing the iterator interface and generate the next token while you progress in the parsing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.