unstruct wikipedia synonym bracket - python

I want to unstruct wikipedia synonym bracket.
Here's a easy one to do.
He is [[Korean]].
I can remove bracket.
Here's another difficult one.
He lives in [[Gimhae city|Gimhae]].
The first one(Gimhae city) is wikipedia document title.
So I have to get second one in bracket.
Any suggestion is welcome.

You can use the following regex:
\[{2}(?:[^|\]]*\|)?([^]]*)]{2}
And relace with \1.
See demo
Here is what the regex matches:
\[{2} - 2 opening square brackets
(?:[^|\]]*\|)? - 0 or 1 sequence of characters other than | and ] (with [^|\]]*) and a literal | with \| (note it is escaped outside of character class)
([^]]*) - matches and captures into Group 1 that we'll reference later with \1 0 or more characters other than a closing square bracket
]{2} - 2 closing square brackets (note we do not have to escape them here since the first [ was escaped).
The Python snippet:
import re
p = re.compile(r'\[{2}(?:[^|\]]*\|)?([^]]*)]{2}')
test_str = "He lives in [[Gimhae city|Gimhae]]. He lives in [[Gimhae]]. "
result = re.sub(p, r"\1", test_str)
print(result) # => He lives in Gimhae. He lives in Gimhae.

Related

Regex to fix (all the matches or none) at the end to one

I'm trying to fix the . at the end to only one in a string. For example,
line = "python...is...fun..."
I have the regex \.*$ in Ruby, which is to be replaced by a single ., as in this demo, which don't seem to work as expected. I've searched for similar posts, and the closest I'd got is this answer in Python, which suggests the following,
>>> text1 = 'python...is...fun...'
>>> new_text = re.sub(r"\.+$", ".", text1)
>>> 'python...is...fun.'
But, it fails if I've no . at the end. So, I've tried like \b\.*$, as seen here, but this fails on the 3rd test which has some ?'s at end.
My question is, why \.*$ not matches all the .'s (despite of being greedy) and how to do the problem correctly?
Expected output:
python...is...fun.
python...is...fun.
python...is...fun??.
You might use an alternation matching either 2 or more dots or assert that what is directly to the left is not one of for example ! ? or a dot itself.
In the replacement use a single dot.
(?:\.{2,}|(?<!\.))$
Explanation
(?: Non capture group for the alternation
\.{2,} Match 2 or more dots
| Or
(?<!\.) Get the position where directly to the left is not a . (which you can extend with other characters as desired)
) Close non capture group
$ End of string (Or use \Z if there can be no newline following)
Regex demo | Python demo
For example
import re
strings = [
"python...is...fun...",
"python...is...fun",
"python...is...fun??"
]
for s in strings:
new_text = re.sub(r"(?:\.{2,}|(?<!\.))$", ".", s)
print(new_text)
Output
python...is...fun.
python...is...fun.
python...is...fun??.
If an empty string should not be replaced by a dot, you can use a positive lookbehind.
(?:\.{2,}|(?<=[^\s.]))$
Regex demo

Regex : matching integers inside of brackets

I am trying to take off bracketed ends of strings such as version = 10.9.8[35]. I am trying to substitute the integer within brackets pattern
(so all of [35], including brackets) with an empty string using the regex [\[+0-9*\]+] but this also matches with numbers not surrounded by brackets. Am I not using the + quantifier properly?
You could match the format of the number and then match one or more digits between square brackets.
In the replacement using the first capturing group r'\1'
\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]
\b Word boundary
( Capture group 1
[0-9]+ Match 1+ digits
(?:\.[0-9]+)+ Match a . and 1+ digits and repeat that 1 or more times
) Close group
\[[0-9]+\] Match 1+ digits between square brackets
Regex demo
For example
import re
regex = r"\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]"
test_str = "version = 10.9.8[35]"
result = re.sub(regex, r'\1', test_str)
print (result)
Output
version = 10.9.8
No need for regex
s = '10.9.8[35]'
t = s[:s.rfind("[")]
print(t)
But if you insist ;-)
import re
s = '10.9.8[35]'
t = re.sub(r"^(.*?)[[]\d+[]]$", r"\1", s)
print(t)
Breakdown of regex:
^ - begins with
() - Capture Group 1 you want to keep
.*? - Any number of chars (non-greedy)
[[] - an opening [
\d+ 1+ digit
[]] - closing ]
$ - ends with
\1 - capture group 1 - used in replace part of regex replace. The bit you want to keep.
Output in both cases:
10.9.8
Use regex101.com to familiarise yourself more. If you click on any of the regex samples at bottom right of the website, it will give you more info. You can also use it to generate regex code in a variety of languages too. (not good for Java though!).
There's also a great series of Python regex videos on Youtube by PyMoondra.
A simpler regex solution:
import re
pattern = re.compile(r'\[\d+\]$')
s = '10.9.8[35]'
r = pattern.sub('', s)
print(r) # 10.9.8
The pattern matches square brackets at the end of a string with one or more number inside. The sub then replaces the square brackets and number with an empty string.
If you wanted to use the number in the square brackets just change the sub expression such as:
import re
pattern = re.compile(r'\[(\d+)\]$')
s = '10.9.8[35]'
r = pattern.sub(r'.\1', s)
print(r) # 10.9.8.35
Alternatively as said by the other answer you can just find it and splice to get rid of it.

Cannot extract all words using word or whitespace boundary with regex

I need extract double Male-Cat:
a = "Male-Cat Male-Cat Male-Cat-Female"
b = re.findall(r'(?:\s|^)Male-Cat(?:\s|$)', a)
print (b)
['Male-Cat ']
c = re.findall(r'\bMale-Cat\b', a)
print (c)
['Male-Cat', 'Male-Cat', 'Male-Cat']
I need extract tree times Male-Cat:
a = "Male-Cat Male-Cat Male-Cat"
b = re.findall(r'(?:\s|^)Male-Cat(?:\s|$)', a)
print (b)
['Male-Cat ', ' Male-Cat']
c = re.findall(r'\bMale-Cat\b', a)
print (c)
['Male-Cat', 'Male-Cat', 'Male-Cat']
Another strings which are parsed correctly by first way:
a = 'Male-Cat Female-Cat Male-Cat-Female Male-Cat'
a = 'Male-Cat-Female'
a = 'Male-Cat'
Something missing? Can you explain what is wrong and what is correct way?
Use lookarounds to extract words inside whitespace boundaries:
r'(?<!\S)Male-Cat(?!\S)'
See the online regex demo
Details
(?<!\S) - a whitespace or start of string must appear immediately to the left of the current location
Male-Cat - the term to search for
(?!\S) - a whitespace or end of string must appear immediately to the right of the current location
Since (?<!\S) and (?!\S) are zero-width assertions, the whitespace won't be consumed, and consecutive matches will get found.

how to output uppercases with regex using python

I have a string like following:
element = ['ABCa4.daf<<tag1>>permission : wiadsfth.accedsafsds.INTERNET<<tag2>>',]
I am trying with Regular Expression 'findall' to output only the uppercases at the end of string (before tag2)
Here is what I did:
re.findall('<<tag1>>' +"(.*?)"+ '<<tag2>>' , element)
but it comes out with other letters before 'INTERNET', give that these letters before INTERNET change all the time, I can't tag them, too.
can anybody sheds a light? Thank you so much!
You need to allow any symbols before the [A-Z]+:
>>> import re
>>> s = 'ABCa4.daf<<tag1>>permission : wiadsfth.accedsafsds.INTERNET<<tag2>>'
>>> re.findall('<<tag1>>.*?([A-Z]+)<<tag2>>', s)
['INTERNET']
.*? is a non-greedy match for any character. [A-Z]+ matches 1 or more upper case letters.
Just match "any sequence of uppercases, followed by <<tag2>>.
re.findall(r'[A-Z]+(?=<<tag2>>)', element[0])
or
re.findall(r'[A-Z]+(?=[^<>]*<<tag2>>)', element[0])
to handle stuff like INTERNET foobar <<tag2>>.
Finally, to match any sequence of A-Z at any position between start and end tags, you're going to need this little monster:
rr = r"""(?x)
[A-Z]+
(?=
(?:
(?! <<tag1>>) .
) *
<<tag2>>
)
"""
element = ['ABC xyz DEF <<tag1>> permission : INTERNET foo XYZ bar <<tag2>>',]
print re.findall(rr, element[0]) # ['INTERNET', 'XYZ']

Do not match word boundary beetwen parenthesis with python regex

I actually have:
regex = r'\bon the\b'
but need my regex to match only if this keyword (actually "on the") is not between parentheses in the text:
should match:
john is on the beach
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)
should not match:
(my son is )on the beach
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)
I don't think that regex would help you here for a general case.
for your examples, this regex would work as you want it to:
((?<=[^\(\)].{3})\bon the\b(?=.{3}[^\(\)])
description:
(?<=[^\(\)].{3}) Positive Lookbehind - Assert that the regex below
can be matched
[^\(\)] match a single character not present in the list below
\( matches the character ( literally
\) matches the character ) literally
.{3} matches any character (except newline)
Quantifier: Exactly 3 times
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
on the matches the characters on the literally (case sensitive)
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
(?=.{3}[^\(\)]) Positive Lookahead - Assert that the regex below
can be matched
.{3} matches any character (except newline)
Quantifier: Exactly 2 times
[^\(\)] match a single character not present in the list below
\( matches the character ( literally
\) matches the character ) literally
if you want to generalize the problem to any string between the parentheses and the string you are searching for, this will not work with this regex.
the issue is the length of that string between parentheses and your string. In regex the Lookbehind quantifiers are not allowed to be indefinite.
In my regex I used positive Lookahead and positive Lookbehind, the same result could be achieved as well with negative ones, but the issue remains.
Suggestion: write a small python code which can check a whole line if it contain your text not between parentheses, as regex alone can't do the job.
example:
import re
mystr = 'on the'
unWanted = re.findall(r'\(.*'+mystr+'.*\)|\)'+mystr, data) # <- here you put the un-wanted string series, which is easy to define with regex
# delete un-wanted strings
for line in mylist:
for item in unWanted:
if item in line:
mylist.remove(line)
# look for what you want
for line in mylist:
if mystr in line:
print line
where:
mylist: a list contains all the lines you want to search through.
mystr: the string you want to find.
Hope this helped.
In UNIX, grep utility using the following regular expression will be sufficient,
grep " on the " input_file_name | grep -v "\(.* on the .*\)"
How about something like this: ^(.*)(?:\(.*\))(.*)$ see it in action.
As you requested, it "matches only words that are not between parentheses in the text"
So, from:
some text (more text in parentheses) and some not in parentheses
Matches: some text + and some not in parentheses
More examples at the link above.
EDIT: changing answer since the question was changed.
To capture all mentions not within parentheses I'd use some code instead of a huge regex.
Something like this will get you close:
import re
pattern = r"(on the)"
test_text = '''john is on the bich
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)
(my son is )on the bitch
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)'''
match_list = test_text.split('\n')
for line in match_list:
print line, "->",
bracket_pattern = r"(\(.*\))" #remove everything between ()
brackets = re.findall(bracket_pattern, line)
for match in brackets:
line = line.replace(match,"")
matches = re.findall(pattern, line)
for match in matches:
print match
print "\r"
Output:
john is on the bich -> on the
let me put this on the fridge -> on the
he (my son) is on the beach -> on the
arnold is on the road (to home) -> on the
(my son is )on the bitch -> on the (this in the only one that doesn't work)
john is at the beach ->
bob is at the pool (berkeley) ->
the spon (is on the table) ->

Categories

Resources