Python matching regex multiple times in a row (not the findall way) - python

This question is not asking about finding 'a' multiple times in a string etc.
What I would like to do is match:
[ a-zA-Z0-9]{1,3}\.
regexp multiple times, one way of doing this is using |
'[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.|[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.|[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.'
so this matches the regexp 4 or 3 or 2 times.
Matches stuff like:
a. v. b.
m a.b.
Is there any way to make this more coding like?
I tried doing
([ a-zA-Z0-9]{1,3}\.){2,4}
but the functionality is not the same what I expected. THis one matches:
regex.findall(string)
[u' b.', u'b.']
string is:
a. v. b. split them a.b. split somethinf words. THen we say some more words, like ten
Is there any way to do this? THe goal is to match possible english abbreviations and names like Mary J. E. things that the sentence tokenizer recognizes as sentence punctuation but are not.
I want to match all of this:
U.S. , c.v.a.b. , a. v. p.

first of all Your regex will work as you expect :
>>> s="aa2.jhf.jev.d23.llo."
>>> import re
>>> re.search(r'([ a-zA-Z0-9]{1,3}\.){2,4}',s).group(0)
'aa2.jhf.jev.d23.'
But if you want to match some sub strings like U.S. , c.v.a.b. , a. v. p. you need to put the whole of regex in a capture group :
>>> s= 'a. v. b. split them a.b. split somethinf words. THen we say' some more
>>> re.findall(r'(([ a-zA-Z0-9]{1,3}\.){2,4})',s)
[('a. v. b.', ' b.'), ('m a.b.', 'b.')]
then use a list comprehension to get the first matches :
>>> [i[0] for i in re.findall(r'(([ a-zA-Z0-9]{1,3}\.){2,4})',s)]
['a. v. b.', 'm a.b.']

Related

Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?
Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4
Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1

To find words (one or more than one consecutive) having first uppercase in capital?

I need to write a regular expression in python, which can find words from text having first letter in uppercase, these words can be single one or consecutive ones.
For example, for the sentence
Dallas Buyer Club is a great American biographical drama film,co-written by Craig Borten and Melisa Wallack, and Directed by Jean-Marc Vallee.
expexted output should be
'Dallas Buyer Club', 'American', 'Craig Borten', 'Melisa Wallack', 'Directed', 'Jean-Marc Vallee'
I have written a regular expression for this,
([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)
but output of this is
'Dallas Buyer Club', 'Craig Borten, 'Melisa Wallack', 'Jean-Marc Valee'
It is only printing consecutive first uppercase words, not single words like
'American', 'Directed'
also the regular expression,
[A-Z][a-z]+
printing all words but individually,
'Dallas', 'Buyers', 'Club' and so on.
Please help me on this.
I think you mixed up the brackets (and make the regex a bit too complicated. Simply use:
[A-Z][a-z]+(?:\s[A-Z][a-z]+)*
So here we have a matching part [A-Za-z]+ and in order to match more groups, we simply use (...)* to repeat ... zero or more times. In the ... we include the separator(s) (here \s) and the group again ([A-Z][a-z]+).
This will however not include the hyphen between 'Jean' and 'Marc'. In order to include it as well, we can expand the \s:
[A-Z][a-z]+(?:[\s-][A-Z][a-z]+)*
Depending on some other characters (or sequences of characters) that are allowed, you may have to alter the [\s-] part further).
This then generates:
>>> rgx = re.compile(r'[A-Z][a-z]+(?:[\s-][A-Z][a-z]+)*')
>>> txt = r'Dallas Buyer Club is a great American biographical drama film,co-written by Craig Borten and Melisa Wallack, and Directed by Jean-Marc Vallee.'
>>> rgx.findall(txt)
['Dallas Buyer Club', 'American', 'Craig Borten', 'Melisa Wallack', 'Directed', 'Jean-Marc Vallee']
EDIT: in case the remaining characters can be uppercase as well, you can use:
[A-Z][A-Za-z]+(?:[\s-][A-Z][A-Za-z]+)*
Note that this will match words with two or more characters. In case single word characters should be matched as well, like 'J R R Tolkien', then you can write:
[A-Z][A-Za-z]*(?:[\s-][A-Z][A-Za-z]*)*

Python Multiple Strings to Tuples

Hi everyone I wonder if you can help with my problem.
I am defining a function which takes a string and converts it into 5 items in a tuple. The function will be required to take a number of strings, in which some of the items will vary in length. How would I go about doing this as using the indexes of the string does not work for every string.
As an example -
I want to convert a string like the following:
Doctor E212 40000 Peter David Jones
The tuple items of the string will be:
Job(Doctor), Department(E212), Pay(40000), Other names (Peter David), Surname (Jones)
However some of the strings have 2 other names where others will have just 1.
How would I go about converting strings like this into tuples when the other names can vary between 1 and 2?
I am a bit of a novice when it comes to python as you can probably tell ;)
With Python 3, you can just split() and use "catch-all" tuple unpacking with *:
>>> string = "Doctor E212 40000 Peter David Jones"
>>> job, dep, sal, *other, names = string.split()
>>> job, dep, sal, " ".join(other), names
('Doctor', 'E212', '40000', 'Peter David', 'Jones')
Alternatively, you can use regular expressions, e.g. something like this:
>>> m = re.match(r"(\w+) (\w+) (\d+) ([\w\s]+) (\w+)", string)
>>> job, dep, sal, other, names = m.groups()
>>> job, dep, sal, other, names
('Doctor', 'E212', '40000', 'Peter David', 'Jones')

Finding words in phrases using regular expression

I wanna use regular expression to find phrases that contains
1 - One of the N words (any)
2 - All the N words (all )
>>> import re
>>> reg = re.compile(r'.country.|.place')
>>> phrases = ["This is an place", "France is a European country, and a wonderful place to visit", "Paris is a place, it s the capital of the country.side"]
>>> for phrase in phrases:
... found = re.findall(reg,phrase)
... print found
...
Result:
[' place']
[' country,', ' place']
[' place', ' country.']
It seems that I am messing around, I need to specify that I need to find a word, not just a part of word in both cases.
Can anyone help by pointing to the issue ?
Because you are trying to match entire words, use \b to match word boundaries:
reg = re.compile(r'\bcountry\b|\bplace\b')

python, re.search / re.split for phrases which looks like a title, i.e. starting with an uppper case

I have a list of phrases (input by user) I'd like to locate them in a text file, for examples:
titles = ['Blue Team', 'Final Match', 'Best Player',]
text = 'In today Final match, The Best player is Joe from the Blue Team and the second best player is Jack from the Red team.'
1./ I can find all the occurrences of these phrases like so
titre = re.compile(r'(?P<title>%s)' % '|'.join(titles), re.M)
list = [ t for t in titre.split(text) if titre.search(t) ]
(For simplicity, I am assuming a perfect spacing.)
2./ I can also find variants of these phrases e.g. 'Blue team', final Match', 'best player' ... using re.I, if they ever appear in the text.
But I want to restrict to finding only variants of the input phrases with their first letter upper-cased e.g. 'Blue team' in the text, regardless how they were entered as input, e.g. 'bluE tEAm'.
Is it possible to write something to "block" the re.I flag for a portion of a phrase? In pseudo code I imagine generate something like '[B]lue Team|[F]inal Match'.
Note: My primary goal is not, for example, calculating frequency of the input phrases in the text but extracting and analyzing the text fragments between or around them.
I would use re.I and modify the list-comp to:
l = [ t for t in titre.split(text) if titre.search(t) and t[0].isupper() ]
I think regular expressions won't let you specify just a region where the ignore case flag is applicable. However, you can generate a new version of the text in which all the characters have been lower cased, but the first one for every word:
new_text = ' '.join([word[0] + word[1:].lower() for word in text.split()])
This way, a regular expression without the ignore flag will match taking into account the casing only for the first character of each word.
How about modifying the input so that it is in the correct case before you use it in the regular expression?

Categories

Resources