I am trying to make a kind of data miner with python. What I am about to examine is a dictionary of the Greek language. The said dictionary was originally in PDF format, and I turned it into a rougly corresponding HTML format to parse it more easily. I have done some further formating on it, since the data structure was heavily distorted.
My current task is to find and seperately store the individual words, along with their descriptions. So the first thought that came to mind about that, was to identify the words first, apart from their descriptions. The headers of the word's space has a very specific syntax, and I use that to create a corresponding regular expression to match each and every one of them.
There is one problem though. Despite the formatting I have done to HTML so far, there are still many points where a series of logical data is interrupted by the sequence < /br> followed by a newline, with random order. Is there any way to direct my regular expression to "ignore" that sequence, that is to treat that certain sequence as non-existent, when met, and therefore including those matches which are interrupted by it?
That is, without putting a (< br/>\n)? in every part of my RE, to cover every possible case.
The regular expression I use is the following:
(ο|η|το)?( )?<b>([α-ωάέήίόύώϊϋΐΰ])*</b>(, ((ο|η|το)? <b>([α-ωάέήίόύώϊϋΐΰ])*</b>))*( \(.*\))? ([Α-Ω])*\.( \(.*\))?<b>:</b>
and does a fine job with the matching, when the data is not interrupted by the sequence given above.
The problem, in case not understood, lies in that the interrupting sequence can occur anywhere within the match, therefore I am looking for a way other than covering every single spot where the sequence might occur (ignoring the sequence in deciding whether to return a match or not), as I explained earlier.
What you're asking for is a different regular expression.
The new regular expression would be the old one, with (<br\s*?/>\n?)? or the like after every non-quantifier character.
You could write something to transmute a regular expression into the form you're looking for. It would take in your existing regex and produce a br-tolerant regex. No construct in the regular expression grammar exists to do this for you automatically.
I think the easier thing to do is to permute the source document to not contain the sequences you wish to ignore. This should be an easy text substitution.
If it weren't for your explicit use of the <b> tags for meaning, an alternative would be to just take the plain-text document content instead of the HTML content.
Related
I have been learning python for some weeks. Presently I'm having some problems/questions with Python's re.findall().
In some books or videos they use re.search() seldom they use match(). In django documentation I read search finds the first match, and re.match finds a match at the beginning of a string.
But in all cases re.findall() would work well. So why shouldn't I use only re.findall() all the time?
As I want to get better in Python I want to understand it, therefore I asked this question.
Best regards Jonathan
The most important reason is performance: findall needs to find all occurences, so it searches the whole string. search just searches the string until it finds a match, so provided the pattern is present it will be faster. match just checks if the beginning of the string matches the pattern, so it probably doesn't need to search the whole string (except in some edge cases).
So findall will be slower in the best case than match or search.
Additionally findall also stores all matches, so it can easily take much more memory than search or match which only store the first match (or nothing at all).
So findall is more memory expensive.
Last but not least match and search return SRE_Match objects which not only store the matched substring but also the position (and groups if you use patterns with capture groups). Thanks #kindall which posted this in the comments.
So, while you could use findall instead of match or search it can be slower, uses more memory and stores less information for the match. So I wouldn't use it as replacement for search or match except you also need to find all other occurences.
I am trying to extract special chunks of POS tags like this and many chunks from
different patterns are working good and similar sentences can be found using them.But the problem rises when I see the exact sequence of tags I have defined in tagged words as output as chunk but the machine could not find them with the name I have defined.Example:
{<VB><RB.?><VB><NN.?>+<IN>*<JJ.?>*<NN.?>*}
This would easily find sentence of :
Do not take money from internal relations
But when I have another pattern:
{<IN><DT>*<NN.?>+<VBZ><RB.?>*<JJ.?><CC>*<PRP$><NN.?>+<VBZ><JJ.?><TO><VB><CC>*<VBG><PRP><MD><VB>}
for the example:
if the present is not easy, or its size is difficult to quantify, but declining it would satisfy
It is not possible to detect it and shows it as a S only.Although I believe the pattern is exactly the same.Can this be because the clause I am looking for is sometimes in the beginning, sometimes in the middle and sometimes in the end of sentence?Can this be because I use I use PunktSentenceTokenizer?
Any help would be appreciated
Using Python 2.6.6.
I was hoping that the re module provided some method of searching that mimicked the way str.find() works, allowing you to specify a start index, but apparently not...
search() lets me find the first match...
findall() will return all (non-overlapping!) matches of a single pattern
finditer() is like findall(), but via an iterator (more efficient)
Here is the situation... I'm data mining in huge blocks of data. For parts of the parsing, regex works great. But once I find certain matches, I need to switch to a different pattern, or even use more specialized parsing to find where to start searching next. If re.search allowed me to specify a starting index, it would be perfect. But in absence of that, I'm looking at:
Using finditer(), but skipping forward until I reach an index that is past where I want to resume using re. Potential problems:
If the embedded binary data happens to contain a match that overlaps a legitimate match just after the binary chunk...
Since I'm not searching for a single pattern, I'd have to juggle multiple iterators, which also has the possibility of a false match hiding the real one.
Slicing, i.e., creating a copy of the remainder of the data each time I want to search again.
This would be robust, but would force a lot of "needless" copying on data that could be many megabytes.
I'd prefer to keep it so that all match locations were indexes into the single original string object, since I may hang onto them for a while and want to compare them. Finding subsequent matches within separate sliced-off copies is a bookkeeping hassle.
Just occurred to me that I may be able to use a "rotating buffer" sort of approach, but haven't thought it through completely. That might introduce a lot of complexity to the code.
Am I missing any obvious alternatives? Not sure if there would be a way to wrap a huge string with a class that would serve slices... Or a slicing sort of iterator or "string cursor" idiom?
Use a two-pass approach. The first pass uses the first regex to find the "interesting bits" and outputs those offsets into a separate file. You didn't say if you can tell where the "end" of each interesting segment is, but you'd include that too if available. The second pass uses the offsets to load sections of the file as independent strings and then applies whatever secondary regex you like on each smaller string.
I'm not certain that regex is the best approach for this, but it seems to be fairly well suited. Essentially I'm currently parsing some pdfs using pdfminer, and the drawback is that these pdf's are exported powerpoint slides, which means that all animations show up as fairly long copies of strings. Ideally I would like just one copy of each of these strings instead of a copy for each stage of an animation. Right now the current regex pattern I'm using is this:
re.sub(r"([\w^\w]{10,})\1{1,}", "\1", string)
For some reason though, this doesn't seem to change the input string. I feel like for some reason python isn't recognizing the capture group, but I'm not sure how to remedy that issue. Any thoughts appreciated.
Examples:
I would like this
text to be
reduced
I would like this
text to be
reduced
output:
I would like this
text to be
reduced
Update:
To get this to pass the pumping lemma I had to specifically make the assertion that all duplicates were adjacent. This was implied before, but I am now making it explicit to ensure that a solution is possible.
regexps are not the right tool for that task. They are based on the theory of context free languages, and they can't match if a string contains duplicates and remove the duplicates. You may find a course on automata and regexps interesting to read on the topic.
I think Josay's suggestion can be efficient and smart, but I think I got a more simple and pythonic solution, though it has its limits. You can split your string into a list of lines, and pass it through a set():
>>> s = """I would like this
... text to be
...
... reduced
... I would like this
... text to be
...
... reduced"""
>>> print "\n".join(set(s.splitlines()))
I would like this
text to be
reduced
>>>
The only thing with that solution is that you will loose the original order of the lines (the example being a pretty counter example). Also, if you have the same line in two different contexts, you will end up having only one line.
To fix the first problem, you may have to then iterate over your original string a second time to put that set back in order, or simply use an ordered set.
If you got any symbol separating each slide, it would help you merge only the duplicates, fixing the second problem of that solution.
Otherwise a more sophisticated algorithm would be needed, so you can take into account proximity and context. For that a suffix tree could be a good idea, and there are python libraries for that (cf that SO answer).
edit:
using your algorithm I could make it work, by adding support of multiline and adding spaces and endlines to your text matching:
>>> re.match(r"([\w \n]+)\n\1", string, re.MULTILINE).groups()
('I would like this\ntext to be\n\nreduced',)
Though, afaict the \1 notation is not a regular regular expression syntax in the matching part, but an extension. But it's getting late here, and I may as well be totally wrong. Maybe shall I reread those courses? :-)
I guess that the regexp engine's pushdown automata is able to push matches, because it is only a long multiline string that it can pop to match. Though I'd expect it to have side effects...
Some time ago I started writing a BNF-based grammar for the cables which WikiLeaks released. However I now realized that my approach is maybe not the best and I'm looking for some improvement.
A cabe consists of three parts. The head has some RFC2822-style format. This parses usually correct. The text part has a more informal specification. For instance, there is a REF-line. This should start with REF:, but I found different versions. The following regex catches most cases: ^\s*[Rr][Ee][Ff][Ss: ]. So there are spaces in front, different cases and so on. The text part is mostly plain text with some special formatted headings.
We want to recognize each field (date, REF etc.) and put into a database. We chose Pythons SimpleParse. At the moment the parses stops at each field which it doesn't recognize. We are now looking for a more fault-tolerant solution. All fields have some kind of order. When the parser don't recognize a field, it should add some 'not recognized'-blob to the current field and go on. (Or maybe you have some better approach here).
What kind of parser or other kind of solution would you suggest? Is something better around?
Cablemap seems to do what you're searching for: http://pypi.python.org/pypi/cablemap.core/
I haven't looked at the cables but lets take a similar problem and consider the options: Lets say you wanted to write a parser for RFCs, there's an RFC for formatting of RFCs, but not all RFCs follow it.
If you wrote a strict parser, you'll run into the situation you've run into - the outliers will halt your progress - in this case you've got two options:
Split them into two groups, the ones that are strictly formatted and the ones that aren't. Write your strict parser so that it gets the low hanging fruit and figure out based on the number outliers what the best options are (hand processing, outlier parser, etc)
If the two groups are equally sized, or there are more outliers than standard formats - write a flexible parser. In this case regular expressions are going to be more beneficial to you as you can process an entire file looking for a series of flexible regexs, if one of the regexes fails you can easily generate the outlier list. But, since you can make the search against a series of regexes you could build a matrix of pass/fails for each regex.
For 'fuzzy' data where some follow the format and some do not, I much prefer using the regex approach. That's just me though. (Yes, it is slower, but having to engineer the relationship between each match segment so that you have a single query (or parser) that fits every corner case is a nightmare when dealing with human generated input.