Regular expression search all substrings between two keywords with linebreaks [duplicate] - python

This question already has answers here:
Regular expression to extract text between square brackets
(15 answers)
Closed 2 years ago.
Lets say I have a python string '\\[this\\] is \\[some\n text\\].'
s = "\\[this\\] is \\[some\n text\\]."
I would like a regular expression that would return me substrings "this" and "some\n text".
I've tried
re.search(r'\\[(.*)\\]',s)
but it does not work (return None)

You miss one backslash in the regex, and use re.DOTALL for the dot . to match the newline char
import re
s = "\\[this\\] is \\[some\n text\\]."
r = re.findall(r'\\\[(.*?)\\\]', s, flags=re.DOTALL)
print(r) # ['this', 'some\n text']

I will take the string you posted literally, but you can easily edit the regex to match another pattern.
I think that this can do the work:
'\\\\\[(.*?)\\\\\]'
Explained:
\ escapes a character, so with \ you escape a backslash. Since you have to find 2 backslashes, you need 2 more of them as escape characters (4 in total)
For the same reason as above, you need one more \ to escape the [ character
( sets your capturing group
. matches any character
* as many times as possible, but followed by a ? it means as few times as possible
) closes your capturing group
the other 5 \ followed by ] work as explained before (escaping the backslash/bracket sequence)
Hope I helped ;)

You can use use negated character class ([^][]*) with a capture group, and match the \ right before the closing ] outside of the group.
import re
s = "\\[this\\] is \\[some\n text\\]."
print(re.findall(r"\[([^][]*)\\]", s))
Output
['this', 'some\n text']

Related

How to split a string with parentheses and spaces into a list

I want to split strings like:
(so) what (are you trying to say)
what (do you mean)
Into lists like:
[(so), what, (are you trying to say)]
[what, (do you mean)]
The code that I tried is below. In the site regexr, the regex expression match the parts that I want but gives a warning, so... I'm not a expert in regex, I don't know what I'm doing wrong.
import re
string = "(so) what (are you trying to say)?"
rx = re.compile(r"((\([\w \w]*\)|[\w]*))")
print(re.split(rx, string ))
Using [\w \w]* is the same as [\w ]* and also matches an empty string.
Instead of using split, you can use re.findall without any capture groups and write the pattern like:
\(\w+(?:[^\S\n]+\w+)*\)|\w+
\( Match (
\w+ Match 1+ word chars
(?:[^\S\n]+\w+)* Optionally repeat matching spaces and 1+ word chars
\) Match )
| Or
\w+ Match 1+ word chars
Regex demo
import re
string = "(so) what (are you trying to say)? what (do you mean)"
rx = re.compile(r"\(\w+(?:[^\S\n]+\w+)*\)|\w+")
print(re.findall(rx, string))
Output
['(so)', 'what', '(are you trying to say)', 'what', '(do you mean)']
For your two examples you can write:
re.split(r'(?<=\)) +| +(?=\()', str)
Python regex<¯\(ツ)/¯>Python code
This does not work, however, for string defined in the OP's code, which contains a question mark, which is contrary to the statement of the question in terms of the two examples.
The regular expression can be broken down as follows.
(?<=\)) # positive lookbehind asserts that location in the
# string is preceded by ')'
[ ]+ # match one or more spaces
| # or
[ ]+ # match one or more spaces
(?=\() # positive lookahead asserts that location in the
# string is followed by '('
In the above I've put each of two space characters in a character class merely to make it visible.

only keep digits after ":" in regular expression in python [duplicate]

This question already has answers here:
How to grab number after word in python
(4 answers)
Closed 2 years ago.
I want to extract the numbers for each parameter below:
import re
parameters = '''
NO2: 42602
SO2: 42401
CO: 42101
'''
The desired output should be:['42602','42401','42101']
I first tried re.findall(r'\d+',parameters), but it also returns the "2" from "NO2" and "SO2".
Then I tried re.findall(':.*',parameters), but it returns [': 42602', ': 42401', ': 42101']
If I can not rename the "NO2" to "Nitrogen dioxide", is there a way just to collect numbers on the right (after ":")?
Many thanks.
If you do not want to use capturing groups, you could use look behind.
(?<=:\s)\d+
Details:
(?<=:\s): gets string after :\s
\d+: gets digits
I also tried result on python.
import re
parameters = '''
NO2: 42602
SO2: 42401
CO: 42101
'''
result = re.findall(r'(?<=:\s)\d+',parameters)
print (result)
Result
['42602', '42401', '42101']
You can use the following regex to capture the numbers
^\s*\w+:\s(\d+)$
Hereby, ^ in the beginning asserts the position at the start of the line. \s* means that there may be 0 or more whitespaces before the content. \w+:\s matches a word character followed by ":" and space, that is "NO2: ".
Finally, (\d+) matches the following digits you want as a group. $ matches the end of the line.
To get all the matches as a list you can use
matches = re.findall(r'^\s*\w+:\s(\d+)$', parameters, re.MULTILINE)
As re.MULTILINE is specified,
the pattern character '^' matches at the beginning of the string and
at the beginning of each line.
as stated in the docs.
The result is as follows
>> print(matches)
['42602', '42401', '42101']
To put my two cents in, you could simpley use
re.findall(r'(\b\d+\b)', parameters)
See a demo on regex101.com.
If you happen to have other digits floating around somewhere in your string, be more precise with
\w+:\s*(\d+)
See another demo on regex101.com.
re.findall(r'(?<=:\s)\d+', parameters)
Should work. You can learn more about look-behind from here.
You just need to specify where in your string do you want to search for digits, you can use:
re.findall(r': (\d+)', parameters)
This tells Python to look for digits in the part of the string after ":" and the "space".

How to find values in specific format including parenthesis using regular expression in python [duplicate]

This question already has answers here:
Regular expression to return text between parenthesis
(11 answers)
Closed 2 years ago.
I have long string S, and I want to find value (numeric) in the following format "Value(**)", where ** is values I want to extract.
For example, S is "abcdef Value(34) Value(56) Value(13)", then I want to extract values 34, 56, 13 from S.
I tried to use regex as follows.
import re
regex = re.compile('\Value(.*'))
re.findall(regex, S)
But the code yields the result I did not expect.
Edit. I edited some mistakes.
You should escape the parentheses, correct the typo of Value (as opposed to Values), use a lazy repeater *? instead of *, add the missing right parenthesis, and capture what's enclosed in the escaped parentheses with a pair of parentheses:
regex = re.compile(r'Value\((.*?)\)')
Only one of your numbers follows the word 'Value', so you can extract anything inside parentheses. You also need to escape the parentheses which are special characters.
regex = re.compile('\(.*?\)')
re.findall(regex, S)
Output:
['(34)', '(56)', '(13)']
I think what you're looking for is a capturing group that can return multiple matches. This string is: (\(\d{2}\))?. \d matches an digit and {2} matches exactly 2 digits. {1,2} will match 1 or 2 digits ect. ? matches 0 to unlimited number of times. Each group can be indexed and combined into a list. This is a simple implementation and will return the numbers within parentheses.
eg. 'asdasd Value(102222), fgdf(20), he(77)' will match 20 and 77 but not 102222.

re.findall bad escape for category S and P [duplicate]

This question already has answers here:
Python regex matching Unicode properties
(6 answers)
Closed 3 years ago.
I am trying to remove all punctuation and special characters from a string, including numbers, but I get an error: error: bad escape \p at position 2
Does this mean that python's regex does not recognize \p{S} and \p{P}
The code is:
name = "URL-dsds diasa:dksdjsk dskdjs_dskjdks 23232 dsds32 dskdjskds&dsjdsjdhs fddjfd%djshdhjs kdjs¤dskjds öfdfdjfkdj"
re.findall(r'[^\p{P}\p{S}\s\d]+', name.lower())
I expect as output the same as highlighted by regex101:
https://regex101.com/r/HJZAUU/1
Any help?
I followed #WiktorStribiżew comment, to use PyPi regex as it supports Unicode category classes. So I simply did:
pip install regex
import regex as re
name = "URL-dsds diasa:dksdjsk dskdjs_dskjdks 23232 dsds32 dskdjskds&dsjdsjdhs fddjfd%djshdhjs kdjs¤dskjds öfdfdjfkdj"
re.findall(r'[^\p{P}\p{S}\s\d]+', name.lower())
I get output:
['url', 'dsds', 'diasa', 'dksdjsk', 'dskdjs', 'dskjdks', 'dsds',
'dskdjskds', 'dsjdsjdhs', 'fddjfd', 'djshdhjs', 'kdjs', 'dskjds',
'öfdfdjfkdj']
Yes, unfortunately so.
Check out regex101.com
Change the flavor to Python and paste your regex in the field at the top:
Gives you this info on the right:
[^\p{P}\p{S}\s\d]+
gm <Python>
Match a single character not present in the list below [^\p{P}\p{S}\s\d]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\p matches the character p literally (case sensitive) <<<<<<<<<<<<<<<<<<<<<<<<<<<<
{P} matches a single character in the list {P} (case sensitive)<<<<<<<<<<<<<<<<<<
\p matches the character p literally (case sensitive)
{S} matches a single character in the list {S} (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\d matches a digit (equal to [0-9])

regular expression findall errors [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 4 years ago.
I run the following script
a = r'[abc] [abc] [y78]'
paaa = re.compile(r'\[ab.*]')
paaa.findall(a)
I obtained
['[abc] [abc] [y78]']
Why the '[abc]' is missing? The '[abc]' clearly matches the pattern as well. Is there any bug in the python3 re.findall function?
Clarification:
Sorry the paaa should be paaa = re.compile(r'\[ab.*\]')
What I am looking for is something which will return
['[abc]', '[abc]', '[abc] [abc]', '[abc] [abc] [y78]']
Basically, any substring matches the pattern.
The repeated . in [ab.*] is greedy - it'll match as many characters as it can such that those characters are followed by a ]. So, everything in between the first [ and the last ] are matched.
Use lazy repetition instead, with .*?:
a = r'[abc] [abc] [y78]'
paaa = re.compile(r'\[ab.*?]')
print(paaa.findall(a))
['[abc]', '[abc]']
You should escape the right square bracket as well, and use non-greedy repeater *? in your regex:
import re
a = r'[abc] [abc] [y78]'
paaa = re.compile(r'\[ab.*?\]')
print(paaa.findall(a))
This outputs:
['[abc]', '[abc]']

Categories

Resources