How to extract substring between two keywords with exceptional cases? [duplicate] - python

This question already has answers here:
RegExp exclusion, looking for a word not followed by another
(3 answers)
Closed 3 years ago.
I want to extract substring between apple and each in a string. However, if each is followed by box, I want the result be an empty string.
In details, it means:
1)apple costs 5 dollars each -> costs 5 dollars
2)apple costs 5 dollars each box -> ``
I tried re.findall('(?<=apple)(.*?)(?=each)')).
It can tackle 1) but not 2).
How to solve the problem?
Thanks.

You could add a negative lookahead, asserting what is on the right is not box. For a match only you can omit the capturing group.
(?<=apple).*?(?=each(?! box))
Regex demo
If you don't want to match the leading space, you could add that to the lookarounds
import re
s = "apple costs 5 dollars each"
print(re.findall(r'(?<=apple ).*?(?= each(?! box))', s))
Output
['costs 5 dollars']
You can also use a capturing group without the positive lookaheads and use the negative lookahead only. The value is in the first capturing group.
You could make use of word boundaries \b to prevent the word being part of a larger word.
\bapple\b(.*?)\beach\b(?! box)
Regex demo

try this without using regex:
myString = "apple costs 5 dollars each box"
myList = myString.split(" ")
storeString = []
for x in myList:
if x == "apple":
continue
elif x == "each":
break
else:
storeString.append(x)
# using list comprehension
listToStr = ' '.join(map(str, storeString))
print(listToStr)
Output:

Related

Extracting the digit between parentheses in Python [duplicate]

This question already has an answer here:
Regular expression works on regex101.com, but not on prod
(1 answer)
Closed 4 months ago.
import re
print(re.search('\(\d{1,}\)', "b' 1. Population, 2016 (1)'"))
I am trying to extract the digits (one or more) between parentheses in strings. The above codes show my attempted solution. I checked my regular expression on https://regex101.com/ and expected the codes to return True. However, the returned value is None. Can someone let me know what happened?
Your current regex pattern is only valid if you make it a raw string:
inp = "b' 1. Population, 2016 (1)'"
nums = re.findall(r'\((\d{1,})\)', inp)
print(nums) # ['1']
Otherwise, you would have to double escape the \\d in the pattern.
Below RE will help you grab the digits inside the brackets only when the digits are present.
r"\((?P<digits_inside_brackets>\d+)\)
For your scenario, the above RE will match 1 under the group "digits_inside_brackets".
It can be executed through below snippet
import re
user_string = "b' 1. Population, 2016 (1)"
comp = re.compile(r"\((?P<digits_inside_brackets>\d+)\)") # Captures when digits are the only
for i in re.finditer(comp, user_string):
print(i.group("digits_inside_brackets"))
Output for the above snippet
Grab digits even when white space are provided:
r"\(\s*(?P<digits_inside_brackets>\d+)\s*\)
Grab digits inside brackets at any condition:
r"\(\D*(?P<digits_inside_brackets>\d+)\D*\)
Output when applied with above RE

only keep digits after ":" in regular expression in python [duplicate]

This question already has answers here:
How to grab number after word in python
(4 answers)
Closed 2 years ago.
I want to extract the numbers for each parameter below:
import re
parameters = '''
NO2: 42602
SO2: 42401
CO: 42101
'''
The desired output should be:['42602','42401','42101']
I first tried re.findall(r'\d+',parameters), but it also returns the "2" from "NO2" and "SO2".
Then I tried re.findall(':.*',parameters), but it returns [': 42602', ': 42401', ': 42101']
If I can not rename the "NO2" to "Nitrogen dioxide", is there a way just to collect numbers on the right (after ":")?
Many thanks.
If you do not want to use capturing groups, you could use look behind.
(?<=:\s)\d+
Details:
(?<=:\s): gets string after :\s
\d+: gets digits
I also tried result on python.
import re
parameters = '''
NO2: 42602
SO2: 42401
CO: 42101
'''
result = re.findall(r'(?<=:\s)\d+',parameters)
print (result)
Result
['42602', '42401', '42101']
You can use the following regex to capture the numbers
^\s*\w+:\s(\d+)$
Hereby, ^ in the beginning asserts the position at the start of the line. \s* means that there may be 0 or more whitespaces before the content. \w+:\s matches a word character followed by ":" and space, that is "NO2: ".
Finally, (\d+) matches the following digits you want as a group. $ matches the end of the line.
To get all the matches as a list you can use
matches = re.findall(r'^\s*\w+:\s(\d+)$', parameters, re.MULTILINE)
As re.MULTILINE is specified,
the pattern character '^' matches at the beginning of the string and
at the beginning of each line.
as stated in the docs.
The result is as follows
>> print(matches)
['42602', '42401', '42101']
To put my two cents in, you could simpley use
re.findall(r'(\b\d+\b)', parameters)
See a demo on regex101.com.
If you happen to have other digits floating around somewhere in your string, be more precise with
\w+:\s*(\d+)
See another demo on regex101.com.
re.findall(r'(?<=:\s)\d+', parameters)
Should work. You can learn more about look-behind from here.
You just need to specify where in your string do you want to search for digits, you can use:
re.findall(r': (\d+)', parameters)
This tells Python to look for digits in the part of the string after ":" and the "space".

Regex: How to find substring that does NOT contain a certain word [duplicate]

This question already has answers here:
Regular expressions: Ensuring b doesn't come between a and c
(4 answers)
Closed 3 years ago.
I have this string;
string = "STARTcandyFINISH STARTsugarFINISH STARTpoisonFINISH STARTBlobpoisonFINISH STARTpoisonBlobFINISH"
I would like to match and capture all substrings that appear in between START and FINISH but only if the word "poison" does NOT appear in that substring. How do I exclude this word and capture only the desired substrings?
re.findall(r'START(.*?)FINISH', string)
Desired captured groups:
candy
sugar
Using a tempered dot, we can try:
string = "STARTcandyFINISH STARTsugarFINISH STARTpoisonFINISH STARTBlobpoisonFINISH STARTpoisonBlobFINISH"
matches = re.findall(r'START((?:(?!poison).)*?)FINISH', string)
print(matches)
This prints:
['candy', 'sugar']
For an explanation of how the regex pattern works, we can have a closer look at:
(?:(?!poison).)*?
This uses a tempered dot trick. It will match, one character at a time, so long as what follows is not poison.

Find substrings matching a pattern allowing overlaps [duplicate]

This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 3 years ago.
So I have strings that form concatenated 1's and 0's with length 12. Here are some examples:
100010011100
001111110000
001010100011
I want to isolate sections of each which start with 1, following with any numbers of zeros, and then ends with 1.
So for the first string, I would want ['10001','1001']
The second string, I would want nothing returned
The third list, I would want ['101','101','10001']
I've tried using a combination of positive lookahead and positive lookbehind, but it isn't working. This is what I've come up with so far [(?<=1)0][0(?=1)]
For a non-regex approach, you can split the string on 1. The matches you want are any elements in the resulting list with a 0 in it, excluding the first and last elements of the array.
Code:
myStrings = [
"100010011100",
"001111110000",
"001010100011"
]
for s in myStrings:
matches = ["1"+z+"1" for i, z in enumerate(s.split("1")[:-1]) if (i>0) and ("0" in z)]
print(matches)
Output:
#['10001', '1001']
#[]
#['101', '101', '10001']
I suggest writing a simple regex: r'10+1'. Then use python logic to find each match using re.search(). After each match, start the next search at the position after the beginning of the match.
Can't do it in one search with a regex.
def parse(s):
pattern = re.compile(r'(10+1)')
match = pattern.search(s)
while match:
yield match[0]
match = pattern.search(s, match.end()-1)

Grab a number from a text file using regex [duplicate]

This question already has an answer here:
Regex to match the last float in a string
(1 answer)
Closed 3 years ago.
I have a text file that has lines of the format something like:
1 12.345 12345.12345678 56.789 textextext
Using python, I want to be able to grab the number that has the format nn.nnn, but only the one in the penultimate column, i.e. for this row, I would like to grab 56.789 (and not 12.345).
I know I can do something like:
re.findall(r' \d\d\.\d\d\d',<my_line>)[0]
but I'm not sure how to make sure I only grab one of the two numbers with this same format.
You may use a greedy match before matching your number:
>>> s = '1 12.345 12345.12345678 56.789 textextext'
>>> print re.findall(r'.*(\b\d+\.\d+)', s)[0]
56.789
RegEx Demo
RegEx Details:
.* is greedy that matches longest possible match before next match
\b is for word boundary
\d+\.\d+ matches a floating point number

Categories

Resources