Grep for specific sentence that contains [] - python

I have a python script that reports how many times an error shows up in catalina.out within a 17 minute time period. Some errors contain more information, displayed in the next three lines beneath the error. Unfortunately the sentence I'm grepping for contains []. I don't want to do a search using regular expressions. Is there a way to turn off the regular expression function and only do an exact search?
Here is an example of a sentence im searching for:
bob: [2012-08-30 02:58:57.326] ERROR: web.errors.GrailsExceptionResolver Exception occurred when processing request: [GET] /bob/event
Thanks

(assuming you are using the standard grep command)
Is there a way to turn off the regular expression function and only do an exact search?
Sure, you can pass the -F flag to grep, like so:
grep -F "[GET]" catalina.out
Remember to put the search term in quotes, or else bash will interpret the brackets in a special way.

If you're using bash and regular grep, you have to escape the [] chars, i.e. \[ ... \],
grep 'bob: \[2012-08-30 02:58:57.326\] ERROR: web.errors.GrailsExceptionResolver Exception occurred when processing request: \[GET\] /bob/event' catalina.out
Not sure if you're really asking how to search for a '17 minute time period' and/or how to 'displayed in the next three lines beneath the error.'
It will help the answers supplied if you show sample input and sample output.
I hope this helps.

What are you searching for? If you need more than a specific exact search, you will probably need to use regular expressions.
There is no need to worry about the brackets. Regex can still search for them. You just need to escape the characters in your regex:
pattern = r'\[\d+-\d+-\d+ \d+:\d+:\d+\.\d+] ERROR:' # or whatever

Related

Why do multi-line strings lead to different pattern matches from single line strings when using python regex?

I am trying to create a Discord Bot that reads users messages and detects when an Amazon link(s) is/are present in their message.
If I use a multi-line string I capture different results from when the message is used on a single line.
Here is the code I am using:
import re
AMAZON_REGEX = re.compile("(http[s]?://[a-zA-Z0-9.-]*(?:amazon|amzn).["
"a-zA-Z]+(?:.+?(?:ref=[^?]+)|.+(?= )|[^?]+))")
def extract_url(message):
foo = AMAZON_REGEX.findall(message)
return foo
user_message = """https://www.amazon.co.uk/dp/B07RLWTXKG blah blah
hello
https://www.amazon.co.uk/dp/B07RLWToop foobar"""
print(extract_url(user_message))
The result of the above code is: ['https://www.amazon.co.uk/dp/B07RLWTXKG blah blah', 'https://www.amazon.co.uk/dp/B07RLWToop']
However, if I change user_message from a multiline string to a single line one then I get the following result: ['https://www.amazon.co.uk/dp/B07RLWTXKG blah blah hello https://www.amazon.co.uk/dp/B07RLWToop']
Why is this the case? Also, how do I capture just the URL without the rest of the users' messages?
It seems like you're having an issue with the exact regex you're using.
Why does the newline change the output?
After parsing the link, it seems like your regex captures the following words, separated by spaces, but the newline character stops the regex from continuing. The fact that there's a newline between "blah" and "hello" in the first case is what's causing the "hello" to not be captured in the multi-line case. As you might know, there's a newline character (\n), a bit like a, * and other character exist.
Only capturing the link
I'm not quite sure what format the amazon link would come in, so it's difficult to say how it should look. However, you know that the link will not contain a space, so stopping the matching when you see a space character would be optimal.
(http[s]?:\/\/[a-zA-Z0-9.-]*(?:amazon|amzn).[a-zA-Z]+(?:.+?(?:ref=[^?]+)|.+(?= )|[^?]+))
(http[s]?:\/\/[a-zA-Z0-9.-]*(?:amazon|amzn).[a-zA-Z]+(?:.+?(?:ref=[^?]+)|[^ ]+(?= )|[^?]+))
In the example above, I turned one of your last . (basically "match all characters") into [^ ] (basically "match all except for a space"). This means you won't start matching the words following the spaces after the word.
Good luck with the Discord bot!
So the reason you're getting a different result between your two different input sources is because you're not doing any checks for the presence of new lines in your regex. This answer goes into a little more detail about how your regex might need to be modified to detect a newline string.
But - if what you really want is just to get a list of links without the rest of the text, you're better off using a different regex string designed to capture just the URL. This post has several different regex strategies for matching just a single URL.

regex for email parsing in python

i'm asked to write regular expression which can catch multi-domain email addresses and implement it in python. so i came up with the following regular expression (and code;the emphasis is on the regex though), which i think is correct:
import re
regex = r'\b[\w|\.|-]+#([\w]+\.)+\w{2,4}\b'
input_string = "hey my mail is abc#def.ghi"
match=re.findall(regex,input_string)
print match
now when i run this (using a very simple mail) it doesn't catch it!!
instead it shows an empty list as the output. can somebody tell me where did i go wrong in the regular expression literal?
Here's a simple one to start you off with
regex = r'\b[\w.-]+?#\w+?\.\w+?\b'
re.findall(regex,input_string) # ['abc#def.ghi']
The problem with your original one is that you don't need the | operator inside a character class ([..]). Just write [\w|\.|-] as [\w.-] (If the - is at the end, you don't need to escape it).
Next there are way too many variations on legitimate domain names. Just look for at least one period surrounded by word characters after the # symbol:
#\w+?\.\w+?\b

Using python to find specific pattern contained in a paragraph

I'm trying to use python to go through a file, find a specific piece of information and then print it to the terminal. The information I'm looking for is contained in a block that looks something like this:
\\Version=EM64L-G09RevD.01\State=1-A1\HF=-1159.6991675\RMSD=4.915e-11\RMSF=1.175e-07\ZeroPoint=0.0353317\
I would like to be able to get the information HF=-1159.6991675. More generally, I would like the script to copy and print \HF=WhateverTheNumberIs\
I've managed to make scripts that are able to copy an entire line and print it out to the terminal, but I am unsure how to accomplish this particular task.
My suggestions is to use regular expressions (regex) in order to catch the required pattern:
import re #for using regular expressions
s = open(<filename here>).read() #read the content of the file and hold it as a string to be scanned
p = re.compile("\HF=[^\]+", re.flags) #p would be the pattern as you described, starting with \HF= till the next \)
print p.findall(s) #finds all occurrences and prints them
Regular expressions is the answer, something like r'/HF.*/'.
Tutorial:- regex tutorial
Once you have learned regex, it is an indispensable resource.

How to combine multiple regular expressions into one line?

My script works fine doing this:
images = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", doc)
videos = re.findall("\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)", doc)
However, I believe it is inefficient to search through the whole document twice.
Here's a sample document if it helps: http://pastebin.com/5kRZXjij
I would expect the following output from the above:
images = http://37.media.tumblr.com/tumblr_lnmh4tD3sM1qi02clo1_500.jpg
videos = http://bassrx.tumblr.com/video_file/86319903607/tumblr_lo8i76CWSP1qi02cl
Instead it would be better to do something like:
image_and_video_links = re.findall(" <match-image-links-or-video links> ", doc)
How can I combine the two re.findall lines into one?
I have tried using the | character but I always fail to match anything. So I'm sure I'm completely confused as to how to use it properly.
As mentioned in the comments, a pipe (|) should do the trick.
The regular expression
(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))
catches either of the two patterns.
Demo on Regex Tester
If you really want efficient...
For starters, I would cut out the \S*? in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.
src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)
Other ideas
You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*
Do you intend for the periods after src and media to mean "any character", or to mean "a literal period"? If the latter, escape them: \.
You can use the re.IGNORECASE option and get rid of some letters:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*

Python Regex working different depending on the implementation?

I'm working on a file parser that needs to cut out comments from JavaScript code. The thing is it has to be smart so it won't take '//' sequence inside string as the beggining of the comment. I have following idea to do it:
Iterate through lines.
Find '//' sequence first, then find all strings surrounded with quotes ( ' or ") in line and then iterate through all string matches to check if the '//' sequence is inside or outside one of those strings. If it is outside of them it's obvious that it'll be a proper comment begining.
When testing code on following line (part of bigger js file of course):
document.getElementById("URL_LABEL").innerHTML="<a name=\"link\" href=\"http://"+url+"\" target=\"blank\">"+url+"</a>";
I've encountered problem. My regular expression code:
re_strings=re.compile(""" "
(?:
\\.|
[^\\"]
)*
"
|
'
(?:
[^\\']|
\\.
)*
'
""",re.VERBOSE);
for s in re.finditer(re_strings,line):
print(s.group(0))
In python 3.2.3 (and 3.1.4) returns the following strings:
"URL_LABEL"
"<a name=\"
" href=\"
"+url+"
" target=\"
">"
"</a>"
Which is obviously wrong because \" should not exit the string. I've been debugging my regex for quite a long time and it SHOULDN'T exit here. So i used RegexBuddy (with Python compatibility) and Python regex tester at http://re-try.appspot.com/ for reference.
The most peculiar thing is they both return same, correct results other than my code, that is:
"URL_LABEL"
"<a name=\"link\" href=\"http://"
"\" target=\"blank\">"
"</a>"
My question is what is the cause of those differences? What have I overlooked? I'm rather a beginer in both Python and regular expressions so maybe the answer is simple...
P.S. I know that finding if the '//' sequence is inside string quotes can be accomplished with one, bigger regex. I've already tried it and met the same problem.
P.P.S I would like to know what I'm doing wrong, why there are differences in behaviour of my code and regex test applications, not find other ideas how to parse JavaScript code.
You just need to use a raw string to create the regex:
re_strings=re.compile(r""" "
etc.
"
""",re.VERBOSE);
The way you've got it, \\.|[^\\"] becomes the regex \.|[^\"], which matches a literal dot (.) or anything that's not a quotation mark ("). Add the r prefix to the string literal and it works as you intended.
See the demo here. (I also used a raw string to make sure the backslashes appeared in the target string. I don't know how you arranged that in your tests, but the backslashes obviously are present; the problem is that they're missing from your regex.)
you cannot deal with matching quotes with regex ... in fact you cannot guarantee any matching pairs of anything(and nested pairs especially) ... you need a more sophisticated statemachine for that(LLVM, etc...)
source: lots of CS classes...
and also see : Matching pair tag with regex for a more detailed explanation
I know its not what you wanted to hear but its basically just the way it is ... and yes different implementations of regex can return different results for stuff that regex cant really do

Categories

Resources