I actually have:
regex = r'\bon the\b'
but need my regex to match only if this keyword (actually "on the") is not between parentheses in the text:
should match:
john is on the beach
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)
should not match:
(my son is )on the beach
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)
I don't think that regex would help you here for a general case.
for your examples, this regex would work as you want it to:
((?<=[^\(\)].{3})\bon the\b(?=.{3}[^\(\)])
description:
(?<=[^\(\)].{3}) Positive Lookbehind - Assert that the regex below
can be matched
[^\(\)] match a single character not present in the list below
\( matches the character ( literally
\) matches the character ) literally
.{3} matches any character (except newline)
Quantifier: Exactly 3 times
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
on the matches the characters on the literally (case sensitive)
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
(?=.{3}[^\(\)]) Positive Lookahead - Assert that the regex below
can be matched
.{3} matches any character (except newline)
Quantifier: Exactly 2 times
[^\(\)] match a single character not present in the list below
\( matches the character ( literally
\) matches the character ) literally
if you want to generalize the problem to any string between the parentheses and the string you are searching for, this will not work with this regex.
the issue is the length of that string between parentheses and your string. In regex the Lookbehind quantifiers are not allowed to be indefinite.
In my regex I used positive Lookahead and positive Lookbehind, the same result could be achieved as well with negative ones, but the issue remains.
Suggestion: write a small python code which can check a whole line if it contain your text not between parentheses, as regex alone can't do the job.
example:
import re
mystr = 'on the'
unWanted = re.findall(r'\(.*'+mystr+'.*\)|\)'+mystr, data) # <- here you put the un-wanted string series, which is easy to define with regex
# delete un-wanted strings
for line in mylist:
for item in unWanted:
if item in line:
mylist.remove(line)
# look for what you want
for line in mylist:
if mystr in line:
print line
where:
mylist: a list contains all the lines you want to search through.
mystr: the string you want to find.
Hope this helped.
In UNIX, grep utility using the following regular expression will be sufficient,
grep " on the " input_file_name | grep -v "\(.* on the .*\)"
How about something like this: ^(.*)(?:\(.*\))(.*)$ see it in action.
As you requested, it "matches only words that are not between parentheses in the text"
So, from:
some text (more text in parentheses) and some not in parentheses
Matches: some text + and some not in parentheses
More examples at the link above.
EDIT: changing answer since the question was changed.
To capture all mentions not within parentheses I'd use some code instead of a huge regex.
Something like this will get you close:
import re
pattern = r"(on the)"
test_text = '''john is on the bich
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)
(my son is )on the bitch
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)'''
match_list = test_text.split('\n')
for line in match_list:
print line, "->",
bracket_pattern = r"(\(.*\))" #remove everything between ()
brackets = re.findall(bracket_pattern, line)
for match in brackets:
line = line.replace(match,"")
matches = re.findall(pattern, line)
for match in matches:
print match
print "\r"
Output:
john is on the bich -> on the
let me put this on the fridge -> on the
he (my son) is on the beach -> on the
arnold is on the road (to home) -> on the
(my son is )on the bitch -> on the (this in the only one that doesn't work)
john is at the beach ->
bob is at the pool (berkeley) ->
the spon (is on the table) ->
Related
I have a text like this
EXPRESS blood| muscle| testis| normal| tumor| fetus| adult
RESTR_EXPR soft tissue/muscle tissue tumor
Right now I want to only extract the last item in EXPRESS line, which is adult.
My pattern is:
[|](.*?)\n
The code goes greedy to muscle| testis| normal| tumor| fetus| adult. Can I know if there is any way to solve this issue?
You can take the capture group value exclude matching pipe chars after matching a pipe char followed by optional spaces.
If there has to be a newline at the end of the string:
\|[^\S\n]*([^|\n]*)\n
Explanation
\| Match |
[^\S\n]* Match optional whitespace chars without newlines
( Capture group 1
[^|\n]* Match optional chars except for | or a newline
) Close group 1
\n Match a newline
Regex demo
Or asserting the end of the string:
\|[^\S\n]*([^|\n]*)$
You could use this one. It spares you the space before, handle the \r\n case and is non-greedy:
\|\s*([^\|])*?\r?\n
Tested here
I am trying to search a string in python using regex for a particular word that begins with a space and ends with a space after it. The string in question that I want to search is;
JAKARTA, INDONESIA (1 February 2017)
and I want to get back the ", INDONESIA (" part so I can apply rtrim and ltrim to it. As I could also be returning United Kingdom.
I have attempted to write this code within my python code;
import re
text = "JAKARTA, INDONESIA (1 February 2017)"
countryRegex = re.compile(r'^(,)(\s)([a-zA-Z]+)(\s)(\()$')
mo = countryRegex.search(text)
print(mo.group())
However this prints out the result
AttributeError: 'NoneType' object has no attribute 'group'
Indicated to me that I am not returning any matched objects.
I then attempted to use my regex in regex 101 however it still returns an error here saying "Your regular expression does not match the subject string."
I assumed this would work as I test for literal comma (,) then a space (\s), then one or more letters ([a-zA-Z]+), then another space (\s) and then finally an opening bracket making sure I have escaped it (\(). Is there something wrong with my regex?
You can try use this regex instead, with a Lookbehind and a lookahead so it only matches the State part.
Adding a space in the list can help you match states like United Kingdom.
(?<=, )([a-zA-Z ]+)(?= \()
Test on Regex101
Once you remove the anchors (^ matches the start of string position and $ matches the end of string position), the regex will match the string. However, you may get INDONESIA with a capturing group using:
,\s*([a-zA-Z]+)\s*\(
See the regex demo. match.group(1) will contain the value.
Details:
,\s* - a comma and zero or more whitespaces (replace * with + if you want at least 1 whitespace to be present)
([a-zA-Z]+) - capturing group 1 matching one or more ASCII letters
\s* - zero or more whitespaces
\( - a ( literal symbol.
Sample Python code:
import re
text = "JAKARTA, INDONESIA (1 February 2017)"
countryRegex = re.compile(r',\s*([a-zA-Z]+)\s*\(')
mo = countryRegex.search(text)
if mo:
print(mo.group(1))
An alternative regex that would capture anything between ,+whitespace and whitespace+( is
,\s*([^)]+?)\s*\(
See this regex demo. Here, [^)]+? matches 1+ chars other than ) as few as possible.
I'm trying to find an expression "K others" in a sentence "Chris and 34K others"
I tried with regular expression, but it doesn't work :(
import re
value = "Chris and 34K others"
m = re.search("(.K.others.)", value)
if m:
print "it is true"
else:
print "it is not"
Guessing that you're web-page scraping "you and 34k others liked this on Facebook", and you're wrapping "K others" in a capture group, I'll jump straight to how to get the number:
import re
value = "Chris and 34K others blah blah"
# regex describes
# a leading space, one or more characters (to catch punctuation)
# , and optional space, trailing 'K others' in any capitalisation
m = re.search("\s(\w+?)\s*K others", value, re.IGNORECASE)
if m:
captured_values = m.groups()
print "Number of others:", captured_values[0], "K"
else:
print "it is not"
Try this code on repl.it
This should also cover uppercase/lowercase K, numbers with commas (1,100K people), spaces between the number and the K, and work if there's text after 'others' or if there isn't.
You should use search rather than match unless you expect your regular expression to match at the beginning. The help string for re.match mentions that the pattern is applied at the start of the string.
If you want to match something within the string, use re.search. re.match starts at the beginning, Also, change your RegEx to: (K.others), the last . ruins the RegEx as there is nothing after, and the first . matches any character before. I removed those:
>>> bool(re.search("(K.others)", "Chris and 34K others"))
True
The RegEx (K.others) matches:
Chris and 34K others
^^^^^^^^
Opposed to (.K.others.), which matches nothing. You can use (.K.others) as well, which matches the character before:
Chris and 34K others
^^^^^^^^^
Also, you can use \s to escape space and match only whitespace characters: (K\sothers). This will literally match K, a whitespace character, and others.
Now, if you want to match all preceding and all following, try: (.+)?(K\sothers)(\s.+)?. Here's a link to repl.it. You can get the number with this.
I want to use re.MULTILINE but NOT re.DOTALL, so that I can have a regex that includes both an "any character" wildcard and the normal . wildcard that doesn't match newlines.
Is there a way to do this? What should I use to match any character in those instances that I want to include newlines?
To match a newline, or "any symbol" without re.S/re.DOTALL, you may use any of the following:
(?s). - the inline modifier group with s flag on sets a scope where all . patterns match any char including line break chars
Any of the following work-arounds:
[\s\S]
[\w\W]
[\d\D]
The main idea is that the opposite shorthand classes inside a character class match any symbol there is in the input string.
Comparing it to (.|\s) and other variations with alternation, the character class solution is much more efficient as it involves much less backtracking (when used with a * or + quantifier). Compare the small example: it takes (?:.|\n)+ 45 steps to complete, and it takes [\s\S]+ just 2 steps.
See a Python demo where I am matching a line starting with 123 and up to the first occurrence of 3 at the start of a line and including the rest of that line:
import re
text = """abc
123
def
356
more text..."""
print( re.findall(r"^123(?s:.*?)^3.*", text, re.M) )
# => ['123\ndef\n356']
print( re.findall(r"^123[\w\W]*?^3.*", text, re.M) )
# => ['123\ndef\n356']
Match any character (including new line):
Regular Expression: (Note the use of space ' ' is also there)
[\S\n\t\v ]
Example:
import re
text = 'abc def ###A quick brown fox.\nIt jumps over the lazy dog### ghi jkl'
# We want to extract "A quick brown fox.\nIt jumps over the lazy dog"
matches = re.findall('###[\S\n ]+###', text)
print(matches[0])
The 'matches[0]' will contain:
'A quick brown fox.\nIt jumps over the lazy dog'
Description of '\S' Python docs:
\S
Matches any character which is not a whitespace character.
( See: https://docs.python.org/3/library/re.html#regular-expression-syntax )
I try to understand the regex in python. How can i split the following sentence with regular expression?
"familyname, Givenname A.15.10"
this is like the phonebook in python regex http://docs.python.org/library/re.html. The person maybe have 2 or more familynames and 2 or more givennames. After the familynames exist ', ' and after givennames exist ''. the last one is the office of the person. What i did until know is
import re
file=open('file.txt','r')
data=file.readlines()
for i in range(90):
person=re.split('[,\.]',data[i],maxsplit=2)
print(person)
it gives me a result like this
['Wegner', ' Sven Ake G', '15.10\n']
i want to have something like
['Wegner', ' Sven Ake', 'G', '15', '10']. any idea?
In the regex world it's often easier to "match" rather than "split". When you're "matching" you tell the RE engine directly what kinds of substrings you're looking for, instead of concentrating on separating characters. The requirements in your question are a bit unclear, but let's assume that
"surname" is everything before the first comma
"name" is everything before the "office"
"office" consists of non-space characters at the end of the string
This translates to regex language like this:
rr = r"""
^ # begin
([^,]+) # match everything but a comma
(.+?) # match everything, until next match occurs
(\S+) # non-space characters
$ # end
"""
Testing:
import re
rr = re.compile(rr, re.VERBOSE)
print rr.findall("de Batz de Castelmore d'Artagnan, Charles Ogier W.12.345")
# [("de Batz de Castelmore d'Artagnan", ', Charles Ogier ', 'W.12.345')]
Update:
rr = r"""
^ # begin
([^,]+) # match everything but a comma
[,\s]+ # a comma and spaces
(.+?) # match everything until the next match
\s* # spaces
([A-Z]) # an uppercase letter
\. # a dot
(\d+) # some digits
\. # a dot
(\d+) # some digits
\s* # maybe some spaces or newlines
$ # end
"""
import re
rr = re.compile(rr, re.VERBOSE)
s = 'Wegner, Sven Ake G.15.10\n'
print rr.findall(s)
# [('Wegner', 'Sven Ake', 'G', '15', '10')]
What you want to do is first split the family name by ,
familyname, rest = text.split(',', 1)
Then you want to split the office with the first space from the right.
givenname, office = rest.rsplit(' ', 1)
Assuming that family names don't have a comma, you can take them easily. Given names are sensible to dots. For example:
Harney, PJ A.15.10
Harvey, P.J. A.15.10
This means that you should probably trim the rest of the record (family names are out) by a mask at the end (regex "maskpattern$").