REGEX_String between strings in a list - python

From this list:
['AUSTRALIA\nBELMONT PARK (WA', '\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (QLD']
I would like to reduce it to this list:
['BELMONT PARK', 'EAGLE FARM']
You can see from the first list that the desired words are between '\n' and '('.
My attempted solution is:
for i in x:
result = re.search('\n(.*)(', i)
print(result.group(1))
This returns the error 'unterminated subpattern'.
Thankyou

You’re getting an error because the ( is unescaped. Regardless, it will not work, as you’ll get the following matches:
\nBELMONT PARK (
\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (
You can try the following:
(?<=\\n)(?!.*\\n)(.*)(?= \()
(?<=\\n): Positive lookbehind to ensure \n is before match
(?!.*\\n): Negative lookahead to ensure no further \n is included
(.*): Your match
(?= \(): Positive lookahead to ensure ( is after match

You can get the matches without using any lookarounds, as you are already using a capture group.
\n(.*) \(
Explanation
\n Match a newline
(.*) Capture group 1, match any character except a newline, as much as possible
\( Match a space and (
See a regex101 demo and a Python demo.
Example
import re
x = ['AUSTRALIA\nBELMONT PARK (WA', '\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (QLD']
pattern = r"\n(.*) \("
for i in x:
m = re.search(pattern, i)
if m:
print(m.group(1))
Output
BELMONT PARK
EAGLE FARM
If you want to return a list:
x = ['AUSTRALIA\nBELMONT PARK (WA', '\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (QLD']
pattern = r"\n(.*) \("
res = [m.group(1) for i in x for m in [re.search(pattern, i)] if m]
print(res)
Output
['BELMONT PARK', 'EAGLE FARM']

Related

How to replace characters in a text by space except for list of words in python

I want to replace all characters in a text by spaces, but I want to leave a list of words.
For instante:
text = "John Thomas bought 300 shares of Acme Corp. in 2006."
list_of_words = ['Acme Corp.', 'John Thomas']
My wanted output would be:
output_text = "*********** ********** "
I would like to change unwanted characters to spaces before I do the * replacement:
"John Thomas Acme Corp. "
Right know I know how to replace only the list of words, but cannot come out with the spaces part.
rep = {key: len(key)*'_**_' for key in list_of_words}
rep = dict((re.escape(k), v) for k, v in rep.items())
pattern = re.compile("|".join(rep.keys()))
pattern.sub(lambda m: rep[re.escape(m.group(0))], text)
You may build a pattern like
(?s)word1|word2|wordN|(.)
When Group 1 matches, replace with a space, else, replace with the same amount of asterisks as the match text length:
import re
text = "John Thomas bought 300 shares of Acme Corp. in 2006."
list_of_words = ['Acme Corp.', 'John Thomas']
pat = "|".join(sorted(map(re.escape, list_of_words), key=len, reverse=True))
pattern = re.compile(f'{pat}|(.)', re.S)
print(pattern.sub(lambda m: " " if m.group(1) else len(m.group(0))*"*", text))
=> '*********** ********** '
See the Python demo
Details
sorted(map(re.escape, list_of_words), key=len, reverse=True) - escapes words in list_of_words and sorts the list by length in descending order (it will be necessary if there are multiword items)
"|".join(...) - build the alternatives out of list_of_words items
lambda m: " " if m.group(1) else len(m.group(0))*"*" - if Group 1 matches, replace with a space, else with the asterisks of the same length as the match length.

Regex for Matching Apostrophe 's' words

I'm trying to create a regex to match a word that has or doesn't have an apostrophe 's' at the end. For the below example, I'd like add a regex to replace the apostrophe with the regex to match either an apostrophe 's' or just an 's'.
Philip K Dick's Electric Dreams
Philip K Dicks Electric Dreams
What I am trying so far is below, but I'm not getting it to match correctly. Any help here is great. Thanks!
Philip K Dick[\'[a-z]|[a-z]] Electric Dreams
Just set the apostrophe as optional in the regex pattern.
Like this: [a-zA-Z]+\'?s,
For example, using your test strings:
import re
s1 = "Philip K Dick's Electric Dreams"
s2 = "Philip K Dicks Electric Dreams"
>>> re.findall("[a-zA-Z]+\'?s", s1)
["Dick's", 'Dreams']
>>> re.findall("[a-zA-Z]+\'?s", s2)
['Dicks', 'Dreams']
You can use the regex (\w+)'s to represent any letters followed by 's. Then you can substitute back that word followed by just s.
>>> s = "Philip K Dick's Electric Dreams"
>>> re.sub(r"(\w+)'s", r'\1s', s)
'Philip K Dicks Electric Dreams'

I want to match money amount with regex for indian currency without commas

I want to match amount like Rs. 2000 , Rs.2000 , Rs 20,000.00 ,20,000 INR 200.25 INR.
Output should be
2000,2000,20000.00,20000,200.25
The regular expression i have tried is this
(?:(?:(?:rs)|(?:inr))(?:!-{0,}|\.{1}|\ {0,}|\.{1}\ {0,}))(-?[\d,]+ (?:\.\d+)?)(?:[^/^-^X^x])|(?:(-?[\d,]+(?:\.\d+)?)(?:(?:\ {0,}rs)|(?:\ {0,}rs)|(?:\ {0,}(inr))))
But it is not matching numbers with inr or rs after the amount
I want to match it using re library in Python.
I suggest using alternation group with capture groups inside to only match the numbers before or after your constant string values:
(?:Rs\.?|INR)\s*(\d+(?:[.,]\d+)*)|(\d+(?:[.,]\d+)*)\s*(?:Rs\.?|INR)
See the regex demo.
Pattern explanation:
(?:Rs\.?|INR)\s*(\d+(?:[.,]\d+)*) - Branch 1:
(?:Rs\.?|INR) - matches Rs, Rs., or INR...
\s* - followed with 0+ whitespaces
(\d+(?:[.,]\d+)*) - Group 1: one or more digits followed with 0+ sequences of a comma or a dot followed with 1+ digits
| - or
(\d+(?:[.,]\d+)*)\s*(?=Rs\.?|INR) - Branch 2:
(\d+(?:[.,]\d+)*) - Group 2 capturing the same number as in Branch 1
\s* - zero or more whitespaces
(?:Rs\.?|INR) - followed with Rs, Rs. or INR.
Sample code:
import re
p = re.compile(r'(?:Rs\.?|INR)\s*(\d+(?:[.,]\d+)*)|(\d+(?:[.,]\d+)*)\s*(?:Rs\.?|INR)')
s = "Rs. 2000 , Rs.3000 , Rs 40,000.00 ,50,000 INR 600.25 INR"
print([x if x else y for x,y in p.findall(s)])
See the IDEONE demo
Alternatively, if you can use PyPi regex module, you may leverage branch reset construct (?|...|...) where capture group IDs are reset within each branch:
>>> import regex as re
>>> rx = re.compile(r'(?|(?:Rs\.?|INR)\s*(\d+(?:[.,]\d+)*)|(\d+(?:[.,]\d+)*)\s*(?:Rs\.?|INR))')
>>> prices = [match.group(1) for match in rx.finditer(teststring)]
>>> print(prices)
['2000', '2000', '20,000.00', '20,000', '200.25']
You can access the capture group in each branch by ID=1 (see match.group(1)).
Though slightly out of scope, here's a fingerplay with the newer and far superior regex module by Matthew Barnett (which has the ability of subroutines and branch resets):
import regex as re
rx = re.compile(r"""
(?(DEFINE)
(?<amount>\d[\d.,]+) # amount, starting with a digit
(?<currency1>Rs\.?\ ?) # Rs, Rs. or Rs with space
(?<currency2>INR) # just INR
)
(?|
(?&currency1)
(?P<money>(?&amount))
|
(?P<money>(?&amount))
(?=\ (?&currency2))
)
""", re.VERBOSE)
teststring = "Rs. 2000 , Rs.2000 , Rs 20,000.00 ,20,000 INR 200.25 INR."
prices = [m.group('money') for m in rx.finditer(teststring)]
print prices
# ['2000', '2000', '20,000.00', '20,000', '200.25']
This uses subroutines and a branch reset (thanks to #Wiktor!).
See a demo on regex101.com.
And another:
(([\d+\,]+)(\.\d+)?\s\w{3}|(\w+\.?)\s?[\d+\,]+(\.?\d+))

Python - Regex did not match string

I am trying to get last regex match on a message broadcast by socket, but it returns blank.
>>> msg = ':morgan.freenode.net 353 MechaBot = #xshellz :MechaBot ITechGeek zubuntu whitesn JarodRo SpeedFuse st3v0 anyx danielhyuuga1 AussieKid92 JeDa Failed Guest83885 RiXtEr xryz D-Boy warsoul buggiz rawwBNC MagixZ fedai Sunborn oatgarum dune SamUt Pythonista_ +xinfo madmattco BuGy azuan DarianC stupidpioneers AnTi_MTtr JeDaYoshi|Away PaoLo- StephenS chriscollins Rashk0 morbid1 Lord255 victorix [DS]Matej EvilSoul `|` united Scrawn avira ssnova munsterman Logxen niko gorut Jactive|OFF grauwulf b0lt saapete'
>>> r = re.compile(r"(?P<host>.*?) (?P<code>.*?) (?P<name>.*?) = (?P<msg>.*?)", re.IGNORECASE)
>>> r.search(msg).groups()
(':morgan.freenode.net', '353', 'MechaBot', '')
(?P<host>.*?) (?P<code>.*?) (?P<name>.*?) = (?P<msg>.*)
Try this.This works.See demo.Your code use .*? whch says match as few characters as you can.So while it your previous you have used .*? <space> it matches upto first space it encounters,in the last you have no specified anythng.So it did not match anythin as it s in lazy mode.
https://regex101.com/r/aQ3zJ3/1
You can also use
(?P<host>.*?) (?P<code>.*?) (?P<name>.*?) = (?P<msg>.*?)$
which says match lazily upto end.

Python regex findall

I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p] tags.
Here is my attempt:
regex = ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)
Printing person produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']
What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]']
or ['Barrack Obama', 'Bill Gates'].
import re
regex = ur"\[P\] (.+?) \[/P\]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)
yields
['Barack Obama', 'Bill Gates']
The regex ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?" is exactly the same
unicode as u'[[1P].+?[/P]]+?' except harder to read.
The first bracketed group [[1P] tells re that any of the characters in the list ['[', '1', 'P'] should match, and similarly with the second bracketed group [/P]].That's not what you want at all. So,
Remove the outer enclosing square brackets. (Also remove the
stray 1 in front of P.)
To protect the literal brackets in [P], escape the brackets with a
backslash: \[P\].
To return only the words inside the tags, place grouping parentheses
around .+?.
Try this :
for match in re.finditer(r"\[P[^\]]*\](.*?)\[/P\]", subject):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()
Your question is not 100% clear, but I'm assuming you want to find every piece of text inside [P][/P] tags:
>>> import re
>>> line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
>>> re.findall('\[P\]\s?(.+?)\s?\[\/P\]', line)
['Barack Obama', 'Bill Gates']
you can replace your pattern with
regex = ur"\[P\]([\w\s]+)\[\/P\]"
Use this pattern,
pattern = '\[P\].+?\[\/P\]'
Check here

Categories

Resources