Python Regex Findall non-greedy - python

I am relatively new to regex and I seem to be struggling to understand the greedy vs non-greedy search (if that is indeed the issue here). Let's say I have a simple text such as this:
# numbers: 4 A 3 B
My goal would be to run a findall to get something like the following output:
['# number:', '4 A 3 B', ' 4 A', ' 3 B']
So if I use the following regex with findall, I would expect it to work:
matches = re.findall(r"(# numbers:)(((?:\s\d)(?:\s\D))*)", "# numbers: 4 A 3 B")
However, the actual output is this:
[('# numbers:', ' 4 A 3 B', ' 3 B')]
Can someone explain why the group ((\s\d)(\d\D)) is only matching ' 3 B' and not also ' 4 A'? I assume it has something to do with the greedy vs. non-greedy search of * is that true? And if so, could you explain how to solve this issue?
Thanks in advance!

I would use re.findall here, twice. First, extract the digit/non digit text series, then use re.findall a second time to find the tuples:
inp = "# numbers: 4 A 3 B"
text = re.findall(r'^# numbers:\s+(.*)$', inp)[0]
matches = re.findall(r'(\d+)\s+(\D+)', text)
print(matches) # [('4', 'A '), ('3', 'B')]

Related

Remove numbers from list but not those in a string

I have a list of list as follows
list_1 = ['what are you 3 guys doing there on 5th avenue', 'my password is 5x35omega44',
'2 days ago I saw it', 'every day is a blessing',
' 345000 people have eaten here at the beach']
I want to remove 3, but not 5th or 5x35omega44. All the solutions I have searched for and tried end up removing numbers in an alphanumeric string, but I want those to remain as is. I want my list to look as follows:
list_1 = ['what are you guys doing there on 5th avenue', 'my password is 5x35omega44',
'days ago I saw it', 'every day is a blessing',
' people have eaten here at the beach']
I am trying the following:
[' '.join(s for s in words.split() if not any(c.isdigit() for c in s)) for words in list_1]
Use lookarounds to check if digits are not enclosed with letters or digits or underscores:
import re
list_1 = ['what are you 3 guys doing there on 5th avenue', 'my password is 5x35omega44',
'2 days ago I saw it', 'every day is a blessing',
' 345000 people have eaten here at the beach']
for l in list_1:
print(re.sub(r'(?<!\w)\d+(?!\w)', '', l))
Output:
what are you guys doing there on 5th avenue
my password is 5x35omega44
days ago I saw it
every day is a blessing
people have eaten here at the beach
Regex demo
One approach would be to use try and except:
def is_intable(x):
try:
int(x)
return True
except ValueError:
return False
[' '.join([word for word in sentence.split() if not is_intable(word)]) for sentence in list_1]
It sounds like you should be using regex. This will match numbers separated by word boundaries:
\b(\d+)\b
Here is a working example.
Some Python code may look like this:
import re
for item in list_1:
new_item = re.sub(r'\b(\d+)\b', ' ', item)
print(new_item)
I am not sure what the best way to handle spaces would be for your project. You may want to put \s at the end of the expression, making it \b(\d+)\b\s or you may wish to handle this some other way.
You can use isinstance(word, int) function and get a shorter way to do it, you could try something like this:
[' '.join([word for word in expression.split() if not isinstance(word, int)]) for expression in list_1]
>>>['what are you guys doing there on 5th avenue', 'my password is 5x35omega44',
'days ago I saw it', 'every day is a blessing', 'people have eaten here at the beach']
Combining the very helpful regex solutions provided, in a list comprehension format that I wanted, I was able to arrive at the following:
[' '.join([re.sub(r'\b(\d+)\b', '', item) for item in expression.split()]) for expression in list_1]

Parse sentences with [value](type) format

I want to parse and extract key, values from a given sentence which follow the following format:
I want to get [samsung](brand) within [1 week](duration) to be happy.
I want to convert it into a split list like below:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']
I have tried to split it using [ or ) :
re.split('\[|\]|\(|\)',s)
which is giving output:
['I want to get ',
'samsung',
'',
'brand',
' within ',
'1 week',
'',
'duration',
' to be happy.']
and
re.split('\[||\]|\(|\)',s)
is giving below output :
['I want to get ',
'samsung](brand) within ',
'1 week](duration) to be happy.']
Any help is appreciated.
Note: This is similar to stackoverflow inline links as well where if we type : go to [this link](http://google.com) it parse it as link.
As first step we split the string, and in second step we modify the string:
s = 'I want to get [samsung](brand) within [1 week](duration) to be happy.'
import re
s = re.split('(\[[^]]*\]\([^)]*\))', s)
s = [re.sub('\[([^]]*)\]\(([^)]*)\)', r'\1:\2', i) for i in s]
print(s)
Prints:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']
You may use a two step approach: process the [...](...) first to format as needed and protect these using some rare/unused chars, and then split with that pattern.
Example:
s = "I want to get [samsung](brand) within [1 week](duration) to be happy.";
print(re.split(r'⦅([^⦅⦆]+)⦆', re.sub(r'\[([^][]*)]\(([^()]*)\)', r'⦅\1:\2⦆', s)))
See the Python demo
The \[([^\][]*)]\(([^()]*)\) pattern matches
\[ - a [ char
([^\][]*) - Group 1 ($1): any 0+ chars other than [ and ]
]\( - ]( substring
([^()]*) - Group 2 ($2): any 0+ chars other than ( and )
\) - a ) char.
The ⦅([^⦅⦆]+)⦆ pattern just matches any ⦅...⦆ substring but keeps what is in between as it is captured.
You could replace the ]( pattern first, then split on [) characters
re.replace('\)\[', ':').split('\[|\)',s)
One approach, using re.split with a lambda function:
sentence = "I want to get [samsung](brand) within [1 week](duration) to be happy."
parts = re.split(r'(?<=[\])])\s+|\s+(?=[\[(])', sentence)
processTerms = lambda x: re.sub('\[([^\]]+)\]\(([^)]+)\)', '\\1:\\2', x)
parts = list(map(processTerms, parts))
print(parts)
['I want to get', 'samsung:brand', 'within', '1 week:duration', 'to be happy.']

python regex add space whenever a number is adjacent to a non-number

I am trying to separate non-numbers from numbers in a Python string. Numbers can include floats.
Examples
Original String Desired String
'4x5x6' '4 x 5 x 6'
'7.2volt' '7.2 volt'
'60BTU' '60 BTU'
'20v' '20 v'
'4*5' '4 * 5'
'24in' '24 in'
Here is a very good thread on how to achieve just that in PHP:
Regex: Add space if letter is adjacent to a number
I would like to manipulate the strings above in Python.
Following piece of code works in the first example, but not in the others:
new_element = []
result = [re.split(r'(\d+)', s) for s in (unit)]
for elements in result:
for element in elements:
if element != '':
new_element.append(element)
new_element = ' '.join(new_element)
break
Easy! Just replace it and use Regex variable. Don't forget to strip whitespaces.
Please try this code:
import re
the_str = "4x5x6"
print re.sub(r"([0-9]+(\.[0-9]+)?)",r" \1 ", the_str).strip() // \1 refers to first variable in ()
I used split, like you did, but modified it like this:
>>> tcs = ['123', 'abc', '4x5x6', '7.2volt', '60BTU', '20v', '4*5', '24in', 'google.com-1.2', '1.2.3']
>>> pattern = r'(-?[0-9]+\.?[0-9]*)'
>>> for test in tcs: print(repr(test), repr(' '.join(segment for segment in re.split(pattern, test) if segment)))
'123' '123'
'abc' 'abc'
'4x5x6' '4 x 5 x 6'
'7.2volt' '7.2 volt'
'60BTU' '60 BTU'
'20v' '20 v'
'4*5' '4 * 5'
'24in' '24 in'
'google.com-1.2' 'google.com -1.2'
'1.2.3' '1.2 . 3'
Seems to have the desired behavior.
Note that you have to remove empty strings from the beginning/end of the array before joining the string. See this question for an explanation.

Finding words in phrases using regular expression

I wanna use regular expression to find phrases that contains
1 - One of the N words (any)
2 - All the N words (all )
>>> import re
>>> reg = re.compile(r'.country.|.place')
>>> phrases = ["This is an place", "France is a European country, and a wonderful place to visit", "Paris is a place, it s the capital of the country.side"]
>>> for phrase in phrases:
... found = re.findall(reg,phrase)
... print found
...
Result:
[' place']
[' country,', ' place']
[' place', ' country.']
It seems that I am messing around, I need to specify that I need to find a word, not just a part of word in both cases.
Can anyone help by pointing to the issue ?
Because you are trying to match entire words, use \b to match word boundaries:
reg = re.compile(r'\bcountry\b|\bplace\b')

remove multiple substrings inside a string

Here's hoping somebody can shed some light on this question because it has me stumped. I have a string that looks like this:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
I want this result:
abcdef ghijk lmnop qrs tuv wxyz 0123456789
Having reviewed numerous questions and answers here, the closest I have come to a solution is:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
s = re.sub('\[\[.*?\|', '', s)
s = re.sub('[\]\]]', '', s)
--> abcdef ghijk lmnop wxyz 0123456789
Since not every substring within double brackets contains a pipe, the re.sub removes everything from '[[' to next '|' instead of checking within each set of double brackets.
Any assistance would be most appreciated.
What about this:
In [187]: re.sub(r'([\[|\]])|((?<=\[)\w+\s+\w+(?=|))', '', s)
Out[187]: 'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
I purpose you a contrary method, instead of remove it you can just catch patterns you want. I think this way can make your code more semantics.
There are two patterns you wish to catch:
Case: words outside [[...]]
Pattern: Any words are either leaded by ']] ' or trailed by ' [['.
Regex: (?<=\]\]\s)\w+|\w+(?=\s\[\[)
Case: words inside [[...]]
Pattern: Any words are trailed by ']]'
Regex: \w+(?=\]\])
Example code
1 #!/usr/bin/env python
2 import re
3
4 s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789 "
5
6 p = re.compile('(?<=\]\]\s)\w+|\w+(?=\s\[\[)|\w+(?=\]\])')
7 print p.findall(s)
Result:
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']
>>> import re
>>> s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
>>> re.sub(r'(\[\[[^]]+?\|)|([\[\]])', '', s)
'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
This searches for and removes the following two items:
Two opening brackets followed by a bunch of stuff that isn't a closing bracket followed by a pipe.
Opening or closing brackets.
As a general regex using built-in re module you can use follwing regex that used look-around:
(?<!\[\[)\b([\w]+)\b(?!\|)|\[\[([^|]*)\]\]
you can use re.finditer to get the desire result :
>>> g=re.finditer(r'(?<!\[\[)\b([\w]+)\b(?!\|)|(?<=\[\[)[^|]*(?=\]\])',s)
>>> [j.group() for j in g]
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']
The preceding regex contains from 2 part one is :
(?<=\[\[)[^|]*(?=\]\])
which match any combinations of word characters that not followed by | and not precede by [[.
the second part is :
\[\[([^|]*)\]\]
that will match any thing between 2 brackets except |.

Categories

Resources