I want to use the NLTK chunker for Tamil language (which is an Indic language). However, it says that it doesn't support Unicode because it uses the 'pre' module for regular expressions.
Unresolved Issues
If we use the re module for regular expressions, Python's regular
expression engine generates "maximum recursion depth exceeded" errors
when processing very large texts, even for regular expressions that
should not require any recursion. We therefore use the pre module
instead. But note that pre does not include Unicode support, so
this module will not work with unicode strings.
Any suggestion for a work around or another way to accomplish it?
Chunkers are language-specific, so you need to train one for Tamil anyway. Of course if you are happy with available off-the-shelf solutions (I've got no idea if there are any, e.g. if the link in the now-deleted answer is any good), you can stop reading here. If not, you can train your own but you'll need a corpus that is annotated with the chunks you want to recognize: perhaps you are after NP chunks (the usual case), but maybe it's something else.
Once you have an annotated corpus, look carefully at chapters 6 and 7 of the NLTK book, and especially section 7.3, Developing and evaluating chunkers.. While Chapter 7 begins with the nltk's regexp chunker, keep reading and you'll see how to build a "sequence classifier" that does not rely on the nltk's regexp-based chunking engine. (Chapter 6 is essential for this, so don't skip it).
It's not a trivial task: You need to understand the classifier approach, put the pieces together, probably convert your corpus to IOB format, and finally select features that will give you satisfactory performance. But it is pretty straightforward, and can be carried out for any language or chunking task for which you have an annotated corpus. The only open-ended part is thinking up contextual cues that you can convert into features to help the classifier decide correctly, and experimenting until you find a good mix. (On the up side, it is a much more powerful approach than pure regexp-based solutions, even for ascii text).
You can use LTRC's Shallow Parser for Tamil Language.
You can check online demo, here.
Related
I wrote a lexical analyzer for cpp codes in python, but the problem is when I use input.split(" ") it won't recognize codes like x=2 or function() as three different tokens unless I add an space between them manually, like: x = 2 .
also it fails to recognize the tokens at the beginning of each line.
(if i add spaces between each two tokens and also at the beginning of each line, my code works correctly)
I tried splitting the code first by lines then by space but it got complicated and still I wasn't able to solve the first problem.
Also I thought about splitting it by operators, yet I couldn't actually implement it. plus I need the operators to be recognized as tokens as well, so this might not be a good idea.
I would appreciate it if anyone could give any solution or suggestion, Thank You.
f=open("code.txt")
input=f.read()
input=input.split(" ")
f=open("code.txt")
input=f.read()
input1=input.split("\n")
for var in input1:
var=var.split(" ")
Obviously, if you try to have success splitting such an expression like x=2 and also x = 2... it seems pretty obvious that isn't going to work.
What you are looking is for a solution that works with both right?
Basic solution is to use an and operator, and use the conditions that you need to parse. Note that this solution isn't scalable, neither fits into the category of good practices, but it can help you to figure out better but harder solutions.
if input.split(' ') and input.split('='):
An intermediate solution would be to use regex.
Regex isn't an easy topic, but you can checkout online documentation, and then you have wonderful online tools to check your regex codes.
Regex 101
The last one, would be to convert your input data into an AST, which stands for abstract syntax tree. This is the technique employed by C++ compilers like, for example, Clang.
This last one is a real hard topic, so for figure out a basic lexer, probably will be really time consuming, but maybe it could fit your needs.
The usual approach is to scan the incoming text from left to right. At each character position, the lexical analyser selects the longest string which fits some pattern for a "lexeme", which is either a token or ignored input (whitespace and comments, for example). Then the scan continues at the next character.
Lexical patterns are often described using regular expressions, but the standard regular expression module re is not as much help as it could be for this procedure, because it does not have the facility of checking multiple regular expressions in parallel. (And neither does the possible future replacement, the regex module.) Or, more precisely, the library can check multiple expressions in parallel (using alternation syntax, (...|...|...)), but it lacks an interface which can report which of the alternatives was matched. [Note 1]. So it would be necessary to try every possible pattern one at a time and select whichever one turns out to have the longest match.
Note that the matches are always anchored at the current input point; the lexical analyser does not search for a matching pattern. Every input character becomes part of some lexeme, even if that lexeme is ignored, and lexemes do not overlap.
You can write such an analyser by hand for a simple language, but C++ is hardly a simple language. Hand-built lexical analysers most certainly exist, but all the ones I've seen are thousands of lines of not very readable code. So it's usually easier to build an analyzer automatically using software designed for that purpose. These have been around for a long time -- Lex was written almost 50 years ago, for example -- and if you are planning on writing more than one lexical analyser, you would be well advised to investigate some of the available tools.
Notes
The PCRE2 and Oniguruma regex libraries provide a "callout" feature which I believe could be used for this purpose. I haven't actually seen it used in lexical analysis, but it's a fairly recent addition, particularly for Oniguruma, and as far as I can see, the Python bindings for those two libraries do not wrap the callout feature. (Although, as usual with Python bindings to C libraries, documentation is almost non-existent, so I can't say for certain.)
It is possible to write a Regex which needs in some cases exponential running time. Such an example is (aa|aa)*. If there is an input of an odd number of as it needs exponential running time.
It is easy to test this. If the input contains only as and has length 51, the Regex needs some seconds to compute (on my machine). Instead if the input length is 52 its computing time is not noticeable (I tested this with the built-in Regex-parser of the JavaRE).
I have written a Regex-parser to find the reason for this behavior, but I didn't find it. My parser can build an AST or a NFA based on a Regex. After that it can translate the NFA to a DFA. To do this it uses the powerset construction algorithm.
When I parse the Rgex mentioned above, the parser creates a NFA with 7 states - after conversion there are only 3 states left in the DFA. The DFA represents the more sensible Regex (aa)*, which can be parsed very fast.
Thus, I don't understand why there are parsers which can be so slow. What is the reason for this? Do they not translate the NFA to a DFA? If yes, why not? And what's the technical reasons why they compute so slow?
Russ Cox has a very detailed article about why this is and the history of regexes (part 2, part 3).
Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow. With the exception of backreferences, the features provided by the slow backtracking implementations can be provided by the automata-based implementations at dramatically faster, more consistent speeds.
Largely, it comes down to proliferation of non-regular features in "regular" expressions such as backreferences, and the (continued) ignorance of most programmers that there are better alternatives for regexes that do not contain such features (which is many of them).
While writing the text editor sam in the early 1980s, Rob Pike wrote a new regular expression implementation, which Dave Presotto extracted into a library that appeared in the Eighth Edition. Pike's implementation incorporated submatch tracking into an efficient NFA simulation but, like the rest of the Eighth Edition source, was not widely distributed. Pike himself did not realize that his technique was anything new. Henry Spencer reimplemented the Eighth Edition library interface from scratch, but using backtracking, and released his implementation into the public domain. It became very widely used, eventually serving as the basis for the slow regular expression implementations mentioned earlier: Perl, PCRE, Python, and so on. (In his defense, Spencer knew the routines could be slow, and he didn't know that a more efficient algorithm existed. He even warned in the documentation, “Many users have found the speed perfectly adequate, although replacing the insides of egrep with this code would be a mistake.”) Pike's regular expression implementation, extended to support Unicode, was made freely available with sam in late 1992, but the particularly efficient regular expression search algorithm went unnoticed.
Regular expressions conforming to this formal definition are computable in linear time, because they have corresponding finite automatas. They are built only from parentheses, alternative | (sometimes called sum), Kleene star * and concatenation.
Extending regular expressions by adding, for example, backward references can lead even to NP-complete regular expressions.
Here you can find an example of regular expression recognizing non-prime numbers.
I guess that, such an extended implementation can have non-linear matching time even in simple cases.
I made a quick experiment in Perl and your regular expression computes equally fast for odd and even number of 'a's.
I was wondering if anyone knew of a good python library for evaluation text-based mathematical expressions. So for example,
>>> evaluate("Three plus nine")
12
>>> evaluate("Eight + two")
10
I've seen similar examples that people have done for numeric values and operators in a string. One method used eval to compute the literal value of the expression. And another method of doing this used regex to parse the text.
If there isn't an existing library that handles this well I will probably end up using a combination of the regex and eval techniques for this. I just want to confirm that there isn't already something like this already out there.
You could try pyparsing, which does general recursive descent parsing. In fact, here is something quite close to your second example.
About your other suggestions.
See here about the security issues of eval (ironically, using it for a calculator).
Fundamentally, regular languages are weaker than pushdown automata languages. You shouldn't try to fight a general parse tree problem with regexes.
Is there any practical difference in power between a 'regular expression' as exampled by NLTK's docs and a CFG from the same? There definitely should be, since there are context-free languages which are not regular, but I can't find a concrete example where the CFG approach outshines a regular expression.
http://nltk.org/book/ch07.html
From the documentation of RegexpParser:
The patterns of a clause are executed in order. An earlier
pattern may introduce a chunk boundary that prevents a later
pattern from executing. Sometimes an individual pattern will
match on multiple, overlapping extents of the input. As with
regular expression substitution more generally, the chunker will
identify the first match possible, then continue looking for matches
after this one has ended.
The clauses of a grammar are also executed in order. A cascaded
chunk parser is one having more than one clause. The maximum depth
of a parse tree created by this chunk parser is the same as the
number of clauses in the grammar.
That is, each clause/pattern is executed once. Thus you'll run into trouble as soon as you need the output of a later clause to be matched by an earlier one.
A practical example is the way something that could be a complete sentence on its own can be used as a clause in a larger sentence:
The cat purred.
He heard that the cat purred.
She saw that he heard that the cat purred.
As we can read from the documentation above, when you construct a RegexpParser you're setting an arbitrary limit for the "depth" of this sort of sentence. There is no "recursion limit" for context-free grammars.
The documentation mentions that you can use looping to mitigate this somewhat -- if you run through a suitable grammar two or three or four times, you can get a deeper parse. You can add external logic to loop your grammar many times, or until nothing more can be parsed.
However, as the documentation also notes, the basic approach of this parser is still "greedy". It proceeds like this for a fixed or variable number of steps:
Do as much chunking as you can in one step.
Use the output of the last step as the input of the next step, and repeat.
This is naïve because if an early step makes a mistake, this will ruin the whole parse.
Think of a "garden path sentence":
The horse raced past the barn fell.
And a similar string but an entirely different sentence:
The horse raced past the barn.
It will likely be hard to construct a RegexpParser that will parse both of these sentences, because the approach relies on the initial chunking being correct. Correct initial chunking for one will probably be incorrect initial chunking for the other, yet you can't know "which sentence you're in" until you're at a late level in the parsing logic.
For instance, if "the barn fell" is chunked together early on, the parse will fail.
You can add external logic to backtrack when you end up with a "poor" parse, to see if you can find a better one. However, I think you'll find that at that point, more of the important parts of the parsing algorithm are in your external logic, instead of in RegexpParser.
From Perl's documentation:
study takes extra time to study SCALAR ($_ if unspecified) in anticipation of doing
many pattern matches on the string before it is next modified. This may or may not save
time, depending on the nature and number of patterns you are searching and the distribution
of character frequencies in the string to be searched;
I'm trying to speed up some regular expression-driven parsing that I'm doing in Python, and I remembered this trick from Perl. I realize I'll have to benchmark to determine if there is a speedup, but I can't find an equivalent method in Python.
Perl’s study doesn’t really do much anymore. The regex compiled has gotten a whole, whole lot smarter than it was when study was created.
For example, it compiles alternatives into a trie structure with Aho–Corasick prediction.
Run with perl -Mre=debug to see the sorts of cleverness the regex compiler and execution engine apply.
As far as I know there's nothing like this built into Python. But according to the perldoc:
The way study works is this: a linked list of every character in the
string to be searched is made, so we know, for example, where all the
'k' characters are. From each search string, the rarest character is
selected, based on some static frequency tables constructed from some
C programs and English text. Only those places that contain this
"rarest" character are examined.
This doesn't sound very sophisticated, and you could probably hack together something equivalent yourself.
esmre is kind of vaguely similar. And as #Frg noted, you'll want to use re.compile if you're reusing a single regex (to avoid re-parsing the regex itself over and over).
Or you could use suffix trees (here's one implementation, or here's a C extension with unicode support) or suffix arrays (implementation).