To print only one occurrence of matching pattern using pcregrep - python

Is there any option in pcregrep that allows me to print only one occurrence of the matched string pattern? I came to know about option --match-limit. But pcregrep is not recognizing this options. Is there any specific version that supports this option.
I assume that --match-limit=1 prints only one occurrence of the matched pattern.
You can also let me know on other possible ways. I am executing pcregrep command from a python script via commands utility of python.

Before we look into --match-limit, let's review two options that almost do what you want to do.
Option 1. When you only want to know if you can find a match in a file, but you don't care what the match is, you can use the -l option like so:
pcregrep -l \d\d\d test.txt
where \d\d\d is the pattern and test.txt contains the strings.
Option 2. To count the number of matches, use
pcregrep -c \d\d\d test.txt
This may be the closest we can get to what you want to do.
What is match--limit ?
--match-limit=1does work, but it doesn't do what you want it to do.
From the documentation:
The --match-limit option provides a means of limiting resource usage
when processing patterns that are not going to match, but which have a
very large number of possibilities in their search trees. The classic
example is a pattern that uses nested unlimited repeats. Internally,
PCRE uses a function called match() which it calls repeatedly
(sometimes recursively). The limit set by --match-limit is imposed on
the number of times this function is called during a match, which has
the effect of limiting the amount of backtracking that can take place.
So --match-limit is about memory, not about the number of matches.
Let's try this out:
If you make a file called test.txt and add some lines with three digits, like so:
111
123
456
Then running pcregrep --match-limit=1 \d\d\d test.txt will match all these lines.
But if you run pcregrep --match-limit=1 \d{3} test.txt you will get an error that the resource limit was exceeded.
Looking at the full documentation, I don't see any option to limit the number of matches. Of course you could design your regex to do so.
For more info
You probably know this, but for the short documentation type pcregrep --help
The full documentation can be downloaded in the pcre package from pcre.org
For usage examples, see grep in PCRE

Related

Escape a period properly in snakemake expand function

I am writing a snakemake pipeline where I download various files whose filenames contain many periods, and I am having a devil of a time getting it to properly understand the file names. I have essentially two rules, a download rule and a target rule. Here they are simplified below.
rule download_files_from_s3:
input:
some input
params:
some params
output:
expand("destinationDir/{{sampleName}}\.{{suffix}}")
shell:
shell:
"aws s3 cp input destinationDir"
rule targets:
input:
expand("destinationDir/{sampleName}\.{suffix}", sampleName=sampleNames)
In this formulation snakemake compiles successfully, and properly downloads the files from s3 to where I want them. However, it is unable to find them and "says waiting at most 5 seconds for missing files". I can see when I run the snakemake in dry run mode, the snakemake expects files of the form "destinationDir/sampleName\.suffix" when in reality they exist without a backslash: "destinationDir/sampleName.suffix". My first though was to ditch the backslash, changing my rules to the form:
expand("destinationDir/{sampleName}.{suffix}", sampleName=sampleNames)
This however creates an overly greedy regular expression. My value for suffix should be ".params.txt". When I run the no backslash version snakemake evaluates the wildcard sampleName as "sampleName.params" and the wildcard suffix as "txt". How ought I to best go about this either by forcing the regular expression matching in expand to behave or to have snakemake properly interpret the '' character? My efforts so far haven't been successful.
I don't know snakemake, but reading their docs mentions that you can avoid ambiguity by providing the regex in your wildcards:
{wildcard,regex}
Eg:
{sampleName,[^.]+}
To quote the docs:
[...]
Multiple wildcards in one filename can cause ambiguity. Consider the pattern {dataset}.{group}.txt and assume that a file 101.B.normal.txt is available. It is not clear whether dataset=101.B and group=normal or dataset=101 and group=B.normal in this case.
Hence wildcards can be constrained to given regular expressions. Here we could restrict the wildcard dataset to consist of digits only using \d+ as the corresponding regular expression. With Snakemake 3.8.0, there are three ways to constrain wildcards. First, a wildcard can be constrained within the file pattern, by appending a regular expression separated by a comma:
output: "{dataset,\d+}.{group}.txt"
[...]
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards
Also note that {{wildcard}} will "mask" the wildcard, as mentioned in the expand section of the docs:
You can also mask a wildcard expression in expand such that it will be kept, e.g.
expand("{{dataset}}/a.{ext}", ext=FORMATS)
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#the-expand-function
I'm not sure how I should understand "mask" but reading the aside note at the top leads me to thing it's a way to escape the special meaning of { and use the string literal {wildcard}.
Note that any placeholders in the shell command (like {input}) are always evaluated and replaced when the corresponding job is executed, even if they are occuring inside a comment. To avoid evaluation and replacement, you have to mask the braces by doubling them, i.e. {{input}}.
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
The short answer is you don't need to escape the period, the string in expand isn't a regex.
If your value for the suffix always is "params.txt", a simple solution would be to specify that:
expand("destinationDir/{sampleName}.params.txt", sampleName=sampleNames)
If the suffix is variable but you want to constrain it somehow, according to the latest documentation single brackets around the suffix and setting allow_missing should work to preserve suffix as a wildcard, and you should be able to (as mentioned in this answer) use wildcard constraints to specify the potential suffixes. Assuming that your suffix pattern is ".SOMETHING.txt":
rule targets:
wildcard_constraints:
suffix = "\.[a-zA-Z]+\.txt"
input:
expand("destinationDir/{sampleName}.{suffix}", sampleName=sampleNames, allow_missing=True)
(Wrapping suffix in double brackets instead would also be a way of preserving it as the wildcard {suffix} rather than trying to expand it - your output in download_files_from_s3, as it's written now, also simplifies to
output: "destinationDir/{sampleName}.{suffix}"
where you once again can have wildcard constraints if needed. In fact, given the issue with the file names probably will continue through your workflow, you may want to set wildcard constraints globally by adding
wildcard_constraints:
suffix = "\.[a-zA-Z]+\.txt"
to your snakefile before defining any rules.)

Expected behavior with regular expressions with capturing-groups in pandas' `str.extract()`

I'm trying to get a grasp on regular expressions and I came across with the one included inside the str.extract method:
movies['year']=movies['title'].str.extract('.*\((.*)\).*',expand=True)
It is supposed to detect and extract whichever is in parentheses. So, if given this string: foobar (1995) it should return 1995. However, if I open a terminal and type the following
echo 'foobar (1995)` | grep '.*\((.*)\).*'
matches the whole string instead of only the content between parentheses. I assume the method is working with BRE flavor because of the parentheses scaping, and so is grep (default behavior). Also, regex matches in blue the whole string and green the year (capturing group). Am I missing something here? The regex works perfectly inside python
First of all, the behavior of Pandas .str.extract() is quite expected: it returns only the capturing group contents. The pattern used with extract requires at least 1 capturing group:
pat : string
Regular expression pattern with capturing groups
If you use a named capturing group, the new column will be named after the named group.
The grep command you provided can be reduced to
grep '\((.*)\)'
as grep is capable of matching a line partially (does not require a full line match) and works on a per line basis: once a match is found the whole line is returned. To override that behavior, you may use -o switch.
With grep, you cannot return the capturing group contents. This can be worked around with PCRE regexp powered with -P option, but it is not available on Mac, for example. sed or awk may help in those situations, too.
Try using this:
movies['year']= movies['title'].str.extract('.*\((\d{4})\).*',expand=False)
Set expand= True if you want it to return a DataFrame or when applying multiple capturing groups.
A year is always composed of 4 digits. So the regex: \((\d{4})\) match any date between parentheses.

How to speed up a search in a long document using python?

I was wondering if it is possible to search in Vim using Python in order to speed up a search in a long document.
I have a text document of 140.000 lines.
I have a list (mysearches) with 115 different search patterns.
I want to put all lines with matches in a list (hits)
This is what I do now:
for i in range(0,len(mysearches)-1)
for line in range(1, line("$"))
let idx = match(getline(line), mysearches[i])
if idx >= 0
call add(hits, line)
endif
endfor
endfor
"remove double linenumbers:
let unduplist=filter(copy(hits), 'index(hits, v:val, v:key+1)==-1')
The problem is that this search takes over 5 minutes.
How can I adapt above search to a python search?
How about this:
let pattern=join(mysearches, '\|')
let mylist = systemlist('grep -n "'.pattern.'" '. shellescape(fnamemodify(#%, ':p')). ' | cut -d: -f1')
This works by joining you pattern by \| (e.g. ORing all your different patterns), shelling out and using grep to process your pattern. Grep should be pretty fast, a lot more than vim and possibly also faster than either python or even perl (of course this depends on the pattern).
The return value is a list, containing all matching lines. Since we used the -n switch of grep we received the matching line numbers which were in turn cut out using cut.
systemlist() then contains the output split at \n.So mylist should contain all lines, matching your pattern. This of course depends on your pattern, but if you use standard BRE or ERE (-E) or even perl RE (-P switch) you should be okay. Depending on the flavor of RE desired, the joining part needs to be adjusted.
Note however this is basically untested, for a real robust solution, one would probably add some more error handling (possibly preprocessing of the pattern) and split up the whole part a little bit, so that it is easier to read.
XY problem indeed.
You can use the :vimgrep command like so:
execute "vim /\\(" . join(mysearches, "\\|") . "\\)/ %"
cwindow
I just tested with the the content of the 4017 lines-long .less file I'm working on, pasted 34 times into a new 136579 lines-long file and a list of only 13 searches:
:let foo = ["margin", "padding", "width", "height", "bleu", "gris", "none", "auto", "background", "color", "line", "border", "overflow"]
It took 3 seconds to find the 47634 matching lines which are now conveniently listed in the quickfix window.
YMMV, of course, because the search will take more time as you add items to mysearches and complexify them but I'm fairly sure you'll be able to beat your current timing easily.
You could also use :grep:
execute "grep -snH " . shellescape(join(foo, '\\|')) . " %"

Why does this take so long to match? Is it a bug?

I need to match certain URLs in web application, i.e. /123,456,789, and wrote this regex to match the pattern:
r'(\d+(,)?)+/$'
I noticed that it does not seem to evaluate, even after several minutes when testing the pattern:
re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523')
The expected result would be that there were no matches.
This expression, however, executes almost immediately (note the trailing slash):
re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523/')
Is this a bug?
There is some catastrophic backtracking going on that will cause an exponential amount of processing depending on how long the non-match string is. This has to do with your nested repetitions and optional comma (even though some regex engines can determine that this wouldn't be a match with attempting all of the extraneous repetition). This is solved by optimizing the expression.
The easiest way to accomplish this is to just look for 1+ digits or commas followed by a slash and the end of the string: [\d,]+/$. However, that is not perfect since it would allow for something like ,123,,4,5/.
For this you can use a slightly optimized version of your initial try: (?:\d,?)+/$. First, I made your repeating group non-capturing ((?:...)) which isn't necessary but it provides for a "cleaner match". Next, and the only crucial step, I stopped repeating the \d inside of the group since the group is already repeating. Finally, I removed the unnecessary group around the optional , since ? only affects the last character. Pretty much this will look for one digit, maybe a comma, then repeat, and finally followed by a trailing /.
This can still match an odd string 1,2,3,/, so for the heck of it I improved your original regex with a negative lookbehind: (?:\d,?)+(?<!,)/$. This will assert that there is no comma directly before the trailing /.
First off, I must say that this is not a BUG. What your regex is doing is that it's trying all the possibilities due to the nested repeating patters. Sometimes this process can gobble up a lot of time and resources and when it gets really bad, it’s called catastrophic backtracking.
This is the code of findall function in python source code:
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
As you see it just use the compile() function, so based on _compile() function that actually use Traditional NFA that python use for its regex matching, and base on
this short explain about backtracking in regular expression in Mastering Regular Expressions, Third Edition, by Jeffrey E. F. Friedl!
The essence of an NFA engine is this: it considers each subexpression or component in turn, and whenever it needs to decide between two equally viable options,
it selects one and remembers the other to return to later if need be.
Situations where it has to decide among courses of action include anything with a
quantifier (decide whether to try another match), and alternation (decide which
alter native to try, and which to leave for later).
Whichever course of action is attempted, if it’s successful and the rest of the regex
is also successful, the match is finished. If anything in the rest of the regex eventually causes failure, the regex engine knows it can backtrack to where it chose the
first option, and can continue with the match by trying the other option. This way,
it eventually tries all possible permutations of the regex (or at least as many as
needed until a match is found).
Let's go inside your pattern: So you have r'(\d+(,)?)+/$' with this string '12345121,223456,123123,3234,4523,523523' we have this steps:
At first, the first part of your string (12345121) is matched with \d+, then , is matched with (,)? .
Then based on first step, the whole string is match due to + after the grouping ((\d+(,)?)+)
Then at the end, there is nothing for /$ to be matched. Therefore, (\d+(,)?)+ needs to "backtrack" to one character before the last for check for /$. Again, it don't find any proper match, so after that it's (,)'s turn to backtrack, then \d+ will backtrack, and this backtracking will be continue to end till it return None.
So based on the length of your string it takes time, which in this case is very high, and it create a nested quantifiers entirely!
As an approximately benchmarking, in this case, you have 39 character so you need 3^39 backtracking attempts (we have 3 methods for backtrack).
Now for better understanding, I measure the runtime of the program while changing the length of the string:
'12345121,223456,123123,3234,4523,' 3^33 = 5.559060567×10¹⁵
~/Desktop $ time python ex.py
real 0m3.814s
user 0m3.818s
sys 0m0.000s
'12345121,223456,123123,3234,4523,5' 3^24 = 1.66771817×10¹⁶ #X2 before
~/Desktop $ time python ex.py
real 0m5.846s
user 0m5.837s
sys 0m0.015s
'12345121,223456,123123,3234,4523,523' 3^36= 1.500946353×10¹⁷ #~10X before
~/Desktop $ time python ex.py
real 0m15.796s
user 0m15.803s
sys 0m0.008s
So to avoid this problem you can use one of the below ways:
Atomic grouping (Currently doesn't support in Python, A RFE was created to add it to Python 3)
Reduction the possibility of backtracking by breaking the nested groups to separate regexes.
To avoid the catastrophic backtracking I suggest
r'\d+(,\d+)*/$'

Grep for specific sentence that contains []

I have a python script that reports how many times an error shows up in catalina.out within a 17 minute time period. Some errors contain more information, displayed in the next three lines beneath the error. Unfortunately the sentence I'm grepping for contains []. I don't want to do a search using regular expressions. Is there a way to turn off the regular expression function and only do an exact search?
Here is an example of a sentence im searching for:
bob: [2012-08-30 02:58:57.326] ERROR: web.errors.GrailsExceptionResolver Exception occurred when processing request: [GET] /bob/event
Thanks
(assuming you are using the standard grep command)
Is there a way to turn off the regular expression function and only do an exact search?
Sure, you can pass the -F flag to grep, like so:
grep -F "[GET]" catalina.out
Remember to put the search term in quotes, or else bash will interpret the brackets in a special way.
If you're using bash and regular grep, you have to escape the [] chars, i.e. \[ ... \],
grep 'bob: \[2012-08-30 02:58:57.326\] ERROR: web.errors.GrailsExceptionResolver Exception occurred when processing request: \[GET\] /bob/event' catalina.out
Not sure if you're really asking how to search for a '17 minute time period' and/or how to 'displayed in the next three lines beneath the error.'
It will help the answers supplied if you show sample input and sample output.
I hope this helps.
What are you searching for? If you need more than a specific exact search, you will probably need to use regular expressions.
There is no need to worry about the brackets. Regex can still search for them. You just need to escape the characters in your regex:
pattern = r'\[\d+-\d+-\d+ \d+:\d+:\d+\.\d+] ERROR:' # or whatever

Categories

Resources