I have a list from which I want to extract a part of text from those elements which have the following pattern:
<Start of string><Less than 30 characters> advocate. versus
I only want the <Start of string><Less than 30 characters> part
The code which I think should have worked but didn't:
a = re.search('^.{,30}advocate. versus', text).group(1)
and
a = re.search('^(.{,30})advocate. versus', text).group(1)
Apart from these, I also tried
a = re.search('^(.*)advocate. versus', text).group(1)
which worked, but I only want less than 30 characters, not just any number of characters.
Examples:
Consider the list with two items:
['Mr. Rajesh Bhardwaj, Advocate ..... Appellant Through Ms. Prem Lata Bansal, Sr. Standing Counsel with Mr.Vishnu Sharma, Advocate. versus PRADEEP KUMAR SAHNI ..... Respondent Through None', 'Mr.Vishnu Sharma, Advocate. versus JYOTI APPARELS']
I want to extract the text from second element which has less than 30 characters before "advocate. versus" but not text from the first one which has more than 30 characters. Basically, I want this from the second item:
Mr.Vishnu Sharma,
Ignore the case of the text in the list, assume everything is in lowercase.
Any help would be really appreciated.
This is what you are searching for. You need the zero in the quantifier {0,30}. And As I understood you dont want to capture the advocate versus part. You can use a lookahead for that. If will check if the advocate is there, but will not capture it. Dont use ^ at the start. because it mean "start of the line", your match is not at the start of the line. Also keep in mind - regex are case sensitive. "advocate" and "Advocate" are two different patterns. I made a regex that matches the indeferent of the case
As I understood the match you want has a comma before it, we can use it to extract exactly the value you want. Basically veverything after the comma and before advocate. versus.
(?<=,)[^,]{0,30},(?= [Aa]dvocate\. versus)
demo
https://regex101.com/r/cO5wcg/3
Related
I was trying to parse together a script for a movie into a dataset containing two columns 'speaker_name' and 'line_spoken'. I don't have any issue with the Python part of the problem but parsing the script is the problem.
The schema of the script goes like this:
Now, this, if copied and pasted into a .txt file is something like this:
ARTHUR
Yeah. I mean, that's just--
SOCIAL WORKER
Does my reading it upset you?
He leans in.
ARTHUR
No. I just,-- some of it's
personal. You know?
SOCIAL WORKER
I understand. I just want to make
sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
In the above case, the regex filtering should return the speaker name and the dialogue and not what is happening in actions like the last line: "slides his journal back". The dialogues often exceed more than two lines so please do not provide hard-coded solutions for 2 lines only. I think I am thinking about this problem in just one direction, some other method to filter can also work.
I have worked with scripts that are colon-separated and I don't have any problem parsing those. But in this case, I am getting no specific endpoints to end the search at. It would be a great help if the answer you give has 2 groups, one with name, the other with the dialogue. Like in the case of colon-separated, my regex was:
pattern = r'(^[a-zA-z]+):(.+)'
Also, if possible, please try and explain why you used that certain regex. It will be a learning experience for me.
Use https://www.onlineocr.net/ co convert pdf to text,
It shows immediately the outcome, where names and on the same line with dialogs,
which could allow for a simple processing
ARTHUR Yeah. I mean, that's just--
SOCIAL WORKER Does my reading it upset you?
He leans in.
ARTHUR No. I just,-- some of its personal. You know me ?
SOCIAL WORKER I understand. I just want to make sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
Not sure will it work for longer dialogs.
Another solution is to extract data from the text file that you can download by clicking the "download output file" link . That file is formatted differently. In that file
10 leading spaces will indicate the dialog, and 5 leading spaces the name - a the least for you sample screenshot
The regex is
r" (.+)(\n( [^ ].+\n)+)"
https://regex101.com/r/FQk8uH/1
it puts in group 1 whatever starts with ten spaces and whatever starts with at the exactly five space into the second :
the subexpression " [^ ].+\n" denotes a line where the first five symbols are spaces, the sixth symbol is anything but space, and the rest of symbols until the end of line are arbitrary. Since dialogs tend to be multiline that expression is followed with another plus.
You will have to delete extra white space from dialogue with additional code and/or regex.
If the amount of spaces varies a bit (say 4-6 and 7 - 14 respectively) but has distinct section the regex needs to be adjusted by using variable repetition operator (curly braces {4, 6}) or optional spaces ?.
r" {7, 14}(.+)(\n( {4-6}[^ ].+\n)+)"
The last idea is to use preexisting list of names in play to match them e.g. (SOCIAL WORKER|JOHN|MARY|ARTUR). The https://www.onlineocr.net/ website still could be used to help spot and delete actions
In Python, you can use DOTALL:
re_pattern = re.compile(r'(\b[A-Z ]{3,}(?=\n))\n*(.*?)\n*(?=\b[A-Z ]{3,}\n|$)', re.DOTALL)
print(re.findall(re_pattern, mystr))
\b[A-Z ]{3,}(?=\n) matches speaker name.
\b matches a word boundary
[A-Z ]{3,} matches three or more upper case letters or spaces. (this means this regex won't recognize speaker names with less than three characters. I did this to avoid false positives in special cases but you might wanna change it. Also check what kind of characters might occur in speaker name (dots, minus, lower case...))
(?=\n) is a lookahead insuring the speaker name is directly followed by a new line (avoids false positive if a similar expression appears in a spoken line)
\n* matches newlines
(.*?) matches everything (including new lines thanks to DOTALL) until the next part of the expression (? makes it lazy instead of greedy)
\n* matches newlines
(?=\b[A-Z ]{3,}\n|$) is a lookahead i.e. a non capturing expression insuring the following part is either a speaker name or the end of your string
Output:
[('ARTHUR', "Yeah. I mean, that's just--"), ('SOCIAL WORKER', 'Does my reading it upset you?\n\nHe leans in.'), ('ARTHUR', "No. I just,-- some of it's\n\npersonal. You know?"), ('SOCIAL WORKER', "I understand. I just want to make\n\nsure you're keeping up with it.\n\nShe slides his journal back to him. He holds it in his lap.")]
You'll have to adjust formatting if you want to remove actions from the result though.
When working on this string:
see.Ya23.v2.0023.jpg
I already found out I could get the last occurence of a number by using:
(?P<Frame>\d+(?!.*\d))
It gives me the group containing "0023".
But how do I group everything until that happens?
If I do this:
(?P<Sequence>.*)(?P<Frame>\d+(?!.*\d))
My two groups contain "see.Ya23.v2.002" and "3", when I would like to have to have them contain "see.Ya23.v2." and "0023".
Hope you can help me. Thanks in advance.
You almost got it completely.
just in the first group you can add the lazy indicator ? after any match. that causes to drop the selection at the first possible possition.
(?P<Sequence>.*?)(?P<Frame>\d+(?!.*\d))
this will give you
see.Ya23.v2. and 0023
and if you also want to avoid selecting the dot
(?P<Sequence>.*?)\.(?P<Frame>\d+(?!.*\d))
the result is see.Ya23.v2 and 0023
The simplest and quickest way is to put a negative assertion for a digit
before your digit expression at the start of the Frame group.
This will make sure the Frame is the last complete set of digits and
still allow a greedy Sequence match which give a performance boost.
(?P<Sequence>.*)(?P<Frame>(?<!\d)\d+(?!.*\d))
https://regex101.com/r/LCUoCR/1
The problem is explained in my Youtube video related to how backtracking works in regex.
In short: the .* part matches the whole string first, and then the regex engine starts stepping back through the string to accommodate a part for the subsequent patterns, i.e. for \d+(?!.*\d). Once the 3 is found in see.Ya23.v2.0023.jpg, this pattern matches, and the regex engine returns a match.
All you need is to make sure the char before the \d+ is a non-digit char and you need to use
(?P<Sequence>(?:.*\D)?)(?P<Frame>\d+)(?!.*\d)
See the regex demo.
I've been sitting on this problem for several hours now and I really don't know anymore...
Essentially, I have an A|B|C - type separated regex and for whatever reason C matches over B, even though the individual regexes should be tested from left-to-right and stopped in a non-greedy fashion (i.e. once a match is found, the other regex' are not tested anymore).
This is my code:
text = 'Patients with end stage heart failure fall into stage D of the ABCD classification of the American College of Cardiology (ACC)/American Heart Association (AHA), and class III–IV of the New York Heart Association (NYHA) functional classification; they are characterised by advanced structural heart disease and pronounced symptoms of heart failure at rest or upon minimal physical exertion, despite maximal medical treatment according to current guidelines.'
expansion = "American Heart Association"
re_exp = re.compile(expansion + "|" + r"(?<=\W)" + expansion + "|"\
+ expansion.split()[0] + r"[-\s].*?\s*?" + expansion.split()[-1])
m = re_exp.search(text)
print(m.group(0))
I want regex to find the "expansion" string. In my dataset, sometimes the text has the expansion string slightly edited, for example having articles or prepositions like "for" or "the" between the main nouns. This is why I first try to just match the String as is, then try to match it if it is after any non-word character (i.e. parentheses or, like in the example above, a whole lot of stuff as the space was omitted) and finally, I just go full wild-card to find the string by search for the beginning and ending of the string with wildcards inbetween.
Either way, with the example above I would expect to get the followinging output:
American Heart Association
but what I'm getting is
American College of Cardiology (ACC)/American Heart Association
which is the match for the final regex.
If I delete the final regex or just call re.findall(r"(?<=\W)"+ expansion, text), I get the output I want, meaning the regex is in fact matching properly.
What gives?
The regex looks like this:
American Heart Association|(?<=\W)American Heart Association|American[-\s].*?\s*?Association
The first 2 alternatives match the same text, only the second one has a positive lookbehind prepended.
You can omit that second alternative, as the first alternative without any assertions has either already matched it, or the second part will also not match it if the first one did not match it.
As the pattern matches from left to right and encounters the first occurrence with American, the first and the second alternatives can not match American College of Cardiology.
Then the third alternation can match it, and due to the .*? it can stretch until the first occurrence of Association.
What you might do is for example exclude possible characters to match using a negated character class:
\bAmerican\b[^/,.]*\bAssociation\b
Regex demo
Or you might use a tempered greedy token approach to not allow specific words between the first and last part:
\bAmerican\b(?:(?!American\b|Association\b).)*\bHeart Association\b
Regex demo
So re.findall(r"(?<=\W)"+ expansion, text) works because before the match is a non-word character (denoted \w), "/". Your regex will match "American [whatever random stuff here] Heart Association". This means you match "American College of Cardiology (ACC)/American Heart Association" before you will match the inner string "American Heart Association". E.g. if you deleted the first "American" in your string you would get the match you are looking for with your regex.
You need to be more restrictive with your regex to rule out situations like these.
first time posting, I've lurked for a little while, really excited about the helpful community here.
So, working with "Automate the boring stuff" by Al Sweigart
Doing an exercise that requires I build a regex that finds numbers in standard number format. Three digit, comma, three digits, comma, etc...
So hopefully will match 1,234 and 23,322 and 1,234,567 and 12 but not 1,23,1 or ,,1111, or anything else silly.
I have the following.
import re
testStr = '1,234,343'
matches = []
numComma = re.compile(r'^(\d{1,3})*(,\d{3})*$')
for group in numComma.findall(str(testStr)):
Num = group
print(str(Num) + '-') #Printing here to test each loop
matches.append(str(Num[0]))
#if len(matches) > 0:
# print(''.join(matches))
Which outputs this....
('1', ',343')-
I'm not sure why the middle ",234" is being skipped over. Something wrong with the regex, I'm sure. Just can't seem to wrap my head around this one.
Any help or explanation would be appreciated.
FOLLOW UP EDIT. So after following all your advice that I could assimilate, I got it to work perfectly for several inputs.
import re
testStr = '1,234,343'
numComma = re.compile(r'^(?:\d{1,3})(?:,\d{3})*$')
Num = numComma.findall(testStr)
print(Num)
gives me....
['1,234,343']
Great! BUT! What about when I change the string input to something like
'1,234,343 and 12,345'
Same code returns....
[]
Grrr... lol, this is fun, I must admit.
So the purpose of the exercise is to be able to eventually scan a block of text and pick out all the numbers in this format. Any insight? I thought this would add an additional tuple, not return an empty one...
FOLLOW UP EDIT:
So, a day later(Been busy with 3 daughters and Honey-do lists), I've finally been able to sit down and examine all the help I've received. Here's what I've come up with, and it appears to work flawlessly. Included comments for my own personal understanding. Thanks again for everything, Blckknght, Saleem, mhawke, and BHustus.
My final code:
import re
testStr = '12,454 So hopefully will match 1,234 and 23,322 and 1,234,567 and 12 but not 1,23,1 or ,,1111, or anything else silly.'
numComma = re.compile(r'''
(?:(?<=^)|(?<=\s)) # Looks behind the Match for start of line and whitespace
((?:\d{1,3}) # Matches on groups of 1-3 numbers.
(?:,\d{3})*) # Matches on groups of 3 numbers preceded by a comma
(?=\s|$)''', re.VERBOSE) # Looks ahead of match for end of line and whitespace
Num = numComma.findall(testStr)
print(Num)
Which returns:
['12,454', '1,234', '23,322', '1,234,567', '12']
Thanks again! I have had such a positive first posting experience here, amazing. =)
The issue is due to the fact you're using a repeated capturing group, (,\d{3})* in your pattern. Python's regex engine will match that against both the thousands and ones groups of your number, but only the last repetition will be captured.
I suspect you want to use non-capturing groups instead. Add ?: to the start of each set of parentheses (I'd also recommend, on general principle, to use a raw string, though you don't have escaping issues in your current pattern):
numComma = re.compile(r'^(?:\d{1,3})(?:,\d{3})*$')
Since there are no groups being captured, re.findall will return the whole matched text, which I think is what you wanted. You can also use re.find or re.search and call the group() method on the returned match object to get the whole matched text.
The problem is:
A regex match will return a tuple item for each group. However, it is important to distinguish a group from a capture. Since you only have two parenthese-delimited groups, the matches will always be tuples of two: the first group, and the second. But the second group matches twice.
1: first group, captured
,234: second group, captured
,343: also second group, which means it overwrites ,234.
Unfortunately, it seems that vanilla Python does not have a way to access any captures of a group other than the last one in a manner similar to .NET's regex implementation. However, if you are only interested in getting the specific number, your best bet would be to use re.search(number). If it returns a non-None value, then the input string is a valid number. Otherwise, it is not.
Additionally: A test on your regex. Note that, as Paul Hankin stated, test cases 6 and 7 match even though they shouldn't, due to the first * following the first capturing group, which will make the initial group match any number of times. Otherwise, your regex is correct. Fixed version.
RESPONSE TO EDIT:
The reason now that your regex returns an empty set on ' and ' is because of the ^ and $ anchors in your regex. The ^ anchor, at the start of the regex, says 'this point needs to be at the start of a string'. The $ is its counterpart, saying 'This needs to be at the end of the string'. This is good if you want your entire string from start to end to match the pattern, but if you want to pick out multiple numbers, you should do away with them.
HOWEVER!
If you leave the regex in its current form sans anchors, it will now match the individual elements of 1,23,45 as separate numbers. So for this we need to add a zero-width positive lookahead assertion and say, 'make sure that after this number is either whitespace or the end of a line'. You can see the change here. The tail end, (?=\s|$), is our lookahead assertion: it doesn't capture anything, but just makes sure criteria or met, in this case whitespace (\s) or (|) the end of a line ($).
BUT: In a similar vein, the previous regex would have matched 2 onward in "1234,567", giving us the number "234,567", which would be bad. So we use a lookbehind assertion similar to our lookahead at the end: (?<!^|\s), only match if at the beginning of the string or there is whitespace before the number. This version can be found here, and should soundly satisfy any non-decimal number related needs.
Try:
import re
p = re.compile(ur'(?:(?<=^)|(?<=\s))((?:\d{1,3})(?:,\d{3})*)(?=\s|$)', re.DOTALL)
test_str = """1,234 and 23,322 and 1,234,567 1,234,567,891 200 and 12 but
not 1,23,1 or ,,1111, or anything else silly"""
for m in re.findall(p, test_str):
print m
and it's output will be
1,234
23,322
1,234,567
1,234,567,891
200
12
You can see demo here
This regex, would match any valid number, and would never match an invalid number:
(?<=^|\s)(?:(?:0|[1-9][0-9]{0,2}(?:,[0-9]{3})*))(?=\s|$)
https://regex101.com/r/dA4yB1/1
Motivation
I'm parsing addresses and need to get the address and the country in separated matches, but the countries might have aliases, e.g.:
UK == United Kingdom,
US == USA == United States,
Korea == South Korea,
and so on...
Explanation
So, what I do is create a big regex with all possible country names (at least the ones more likely to appear) separated by the OR operator, like this:
germany|us|france|chile
But the problem is with multi-word country names and their shorter versions, like:
Republic of Moldova and Moldova
Using this as example, we have the string:
'Somewhere in Moldova, bla bla, 12313, Republic of Moldova'
What I want to get from this:
'Somewhere in Moldova, bla bla, more bla, 12313'
'Republic of Moldova'
But this is what I get:
'Somewhere in Moldova, bla bla, 12313, Republic of'
'Moldova'
Regex
As there are several cases, here is what I'm using so far:
^(.*),? \(?(republic of moldova|moldova)\)?(.*[\d\-]+.*|,.*[:/].*)?$
As we might have fax, phone, zip codes or something else after the country name - which I don't care about - I use the last matching group to remove them:
(.*[\d\-]+.*|,.*[:/].*)?
Also, sometimes the country name comes enclosed in parenthesis, so I have \(? and \)? around the second match group, and all the countries go inside it:
(republic of moldova|moldova|...)
Question
The thing is, when there is an entry which is a subset of a bigger one, the shorter is chosen over the longer, and the remainder stays in the base_address string.
Is there a way to tell the regex to choose over the biggest possible match when two values mach?
Edit
I'm using Python with built in re module
As suggested by m.buettner, changing the first matching group from (.*) to (.*?) indeed fixes the current issue, but it also creates another. Consider other example:
'Department of Chemistry, National University of Singapore, 4512436 Singapore'
Matches:
'Department of Chemistry, National University of'
'Singapore'
Here it matches too soon now.
Your problem is greediness.
The .* right at the beginning tries to match as much as possible. That is everything until the end of the string. But then the rest of your pattern fails. So the engine backtracks, and discards the last character matched with .* and tries the rest of the pattern again (which still fails). The engine will repeat this process (fail match, backtrack/discard one character, try again) until it can finally match with the rest of the pattern. The first time this occurs is when .* matches everything up to Moldova (so .* is still consuming Republic of). And then the alternation (which still cannot match republic of moldova) will gladly match moldova and return that as the result.
The simplest solution is to make the repetition ungreedy:
^(.*?)...
Note that the question mark right after a quantifier does not mean "optional", but makes it "ungreedy". This simply reverses the behaviour: the engine first tries to leave out the .* completely, and in the process of backtracking it includes one more character after every failed attempt to match the rest of the pattern.
EDIT:
There are usually better alternatives to ungreediness. As you stated in a comment, the ungreedy solution brings another problem that countries in earlier parts of the string might be matched. What you can do instead, is to use lookarounds that ensure that there are no word characters (letters, digits, underscore) before or after the country. That means, a country word is only matched, if it is surrounded by commas or either end of the string:
^(.*),?(?<!\w)[ ][(]?(c|o|u|n|t|r|i|e|s)[)]?(?![ ]*\w)(.*[\d\-]+.*|,.*[:/].*)?$
Since lookarounds are not actually part of the match, they do not interfere with the rest of your pattern - they simply check a condition at a specific position in the match. The two lookarounds I have added ensure that:
There is no word character before the mandatory space preceding the country.
There is no word character after the country that is separated by nothing but spaces.
Note that I've wrapped spaces in a character class, as well as the literal parentheses (instead of escaping them). Neither is necessary, but I prefer these readability-wise, so they are just a suggestion.
EDIT 2:
As abarnert mentioned in a comment, how about not using a regex-only solution?
You could split the string on ,, then trim every result, and check these against your list of countries (possibly using regex). If any component of your address is the same as one of your countries, you can return that. If there are multiples ones than at least you can detect the ambiguity and deal with it properly.
Sort all alternatives in regex, just create regex programatically by sorted (from longest to shortest) array of names. Then make whole regex in atomic group (PCRE engine has it, don't know if RE engine has it too). Because of atomic group, regex engine never backtrack to try other alternative in atomic group and so u have all alternatives sorted, match will always be the longest one.
Tada.