Or condition for literal string in regex expression - python

I have the following regex expression
re.findall('\(([0-9].*?)\)', a[a.find('('):].strip())
defined for strings like
asdasdasd (21345-asdasdasd)
to retrieve what is inside parenthesis followed by a number. But I also want to be capable to retrieve what is inside followed by 'NA' string, like:
asdasdasd (NA-asdasdasd)
I've tried:
re.findall('\(([0-9].*?)\)|\((NA.*?)\)', a[a.find('('):].strip())
but produces a tuple. How would it be? Thank you in advance!

You may capture the substring between parentheses when the text inside starts with digits / NA followed with - and any other chars other than ( and ) using
re.findall(r'\(((?:[0-9]+|NA)-[^)]*)\)', a)
See the regex demo.
Details
\( - a (
((?:[0-9]+|NA)-[^)]*) - Capturing group (this value will be returned by re.findall):
(?:[0-9]+|NA) - 1 or more digits or NA
- - a hyphen
[^)]* - 0+ chars other than )
\) - a ) char.
See the Python demo:
import re
strs = ['asdasdasd (21345-asdasdasd)', 'asdasdasd (NA-asdasdasd)']
for s in strs:
print(re.findall(r'\(((?:[0-9]+|NA)-[^)]*)\)', s))
Output:
['21345-asdasdasd']
['NA-asdasdasd']

Related

Regex to pull the first and last letter of a string

I am using this \d{2}-\d{2}-\d{4} to validate my string. It works to pull the sequence of numbers out of said string resulting in 02-01-1716 however, i also need to pull the letter the string begins with and ends with; i.e. Q:\Region01s\FY 02\02-01-1716A.pdf i need the Q as well as the A so in the end i would have Q: 02-01-1716A
You can use
import re
regex = r"^([a-zA-Z]:)\\(?:.*\\)?(\d{2}-\d{2}-\d{4}[a-zA-Z]?)"
text = r"Q:\Region01s\FY 02\02-01-1716A.pdf"
match = re.search(regex, text)
if match:
print(f"{match.group(1)} {match.group(2)}")
# => Q: 02-01-1716A
See the Python demo. Also, see the regex demo. Details:
^ - start of string
([a-zA-Z]:) - Group 1: a letter and :
\\ - a backslash
(?:.*\\)? - an optional sequence of any chars other than line break chars as many as possible, followed with a backslash
(\d{2}-\d{2}-\d{4}[a-zA-Z]?) - Group 2: two digits, -, two digits, -, four digits, an optional letter.
The output - if there is a match - is a concatenation of Group 1, space and Group 2 values.
You can try:
(.).*(.)\.[^\.]+$
Or with the validation:
(.).*\d{2}-\d{2}-\d{4}(.)\.[^\.]+$

Python Regex: To capture all words within nested parentheses

I am trying to extract all words within nested parentheses by using regex. Here is an example of my .txt file:
hello ((
(alpha123_4rf)
45beta_Frank))
Red5Great_Sam_Fun
I have tried this with regex:
r'[\((?\(??(^\()?\))]'
but have not been able to get the desired output. I want my output to be like this:
((
(alpha123_4rf)
45beta_Frank))
What am I doing wrong? Any help is greatly appreciated!
Try this pattern (?s)\([^(]*\((.+)\)[^)]*\)
Explanation:
(?s) - flag: single line mode - . matches also newline character
\( - match ( literally
[^(]* - match zero or more characters other from (
\( - match ( literally
(.+) - match one or mroe of any characters and store it inside first capturing group
\) - match ) literally
[^)]* - match zero or more characters other from )
\) - match ) literally
Demo
If the parantheses are directly following each other, this simpler solution would also do it:
def find_brackets(text):
rx = "(?s)\(\((.+)\)\)"
z = re.search(rx,text)
if z:
return z[0]
else:
return ''

python regular expression : how to remove all punctuation characters from a string but keep those between numbers?

I am working on a Chinese NLP project. I need to remove all punctuation characters except those characters between numbers and remain only Chinese character(\u4e00-\u9fff),alphanumeric characters(0-9a-zA-Z).For example,the
hyphen in 12-34 should be kept while the equal mark after 123 should be removed.
Here is my python script.
import re
s = "中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
res = re.sub(u'(?<=[^0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[^0-9])','',s)
print(res)
the expected output should be
中国中国foo中国bar中123国中国12-34中国
but the result is
中国中国foo中国bar中123=国中国12-34中国
I can't figure out why there is an extra equal sign in the output?
Your regex will first check "=" against [^\u4e00-\u9fff0-9a-zA-Z]+. This will succeed. It will then check the lookbehind and lookahead, which must both fail. Ie: If one of them succeeds, the character is kept. This means your code actually keeps any non-alphanumeric, non-Chinese characters which have numbers on any side.
You can try the following regex:
u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))'
You can use it as such:
import re
s = "中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
res = re.findall(u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))',s)
print(res.join(''))
I suggest matching and capturing these characters in between digits (to restore them later in the output), and just match them in other contexts.
In Python 2, it will look like
import re
s = u"中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
pat_block = u'[^\u4e00-\u9fff0-9a-zA-Z]+';
pattern = u'([0-9]+{0}[0-9]+)|{0}'.format(pat_block)
res = re.sub(pattern, lambda x: x.group(1) if x.group(1) else u"" ,s)
print(res.encode("utf8")) # => 中国中国foo中国bar中123国中国12-34中国
See the Python demo
If you need to preserve those symbols inside any Unicode digits, you need to replace [0-9] with \d and pass the re.UNICODE flag to the regex.
The regex will look like
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+)|[^\u4e00-\u9fff0-9a-zA-Z]+
It will works like this:
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+) - Group 1 capturing
[0-9]+ - 1+ digits
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
[0-9]+ - 1+ digits
| - or
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
In Python 2.x, when a group is not matched in re.sub, the backreference to it is None, that is why a lambda expression is required to check if Group 1 matched first.

regex of symbolic expression grouped

In python, I am trying to regex of a expression like this:
function_1(param_1,param_2,param_3)+function_2(param_4,param_5)*function_3(param_6)+function_4()-function_5(param_7,param_8,param_9,param_10)
I am using this regex
(?P<perf_name>\w*?)\((?P<perf_param>[\w]+)*(?:,*(?P<perf_param2>[\w]+)?)*\)
but I'm stuck because so far I can't get all the params_x which are not close to brackets (param_2, param_8 and param_9)
Plus, I am pretty sure there is some solution that would prevent me to use a single perf_param instead of the two perf_param and perf_param2
Any ideas?
You should do that in 2 steps:
(?P<perf_name>\w*)\((?P<perf_params>\w*(?:,\w+)*)\)
This regex will get you the name and params as two groups. Then, just split the second group with ,.
import re
p = re.compile(r'(?P<perf_name>\w*)\((?P<perf_params>\w*(?:,\w+)*)\)')
s = "function_1(param_1,param_2,param_3)+function_2(param_4,param_5)*function_3(param_6)+function_4()-function_5(param_7,param_8,param_9,param_10)"
res = [(x.group("perf_name"), x.group("perf_params").split(",")) for x in p.finditer(s)]
print(res)
# => [('function_1', ['param_1', 'param_2', 'param_3']), ('function_2', ['param_4', 'param_5']), ('function_3', ['param_6']), ('function_4', ['']), ('function_5', ['param_7', 'param_8', 'param_9', 'param_10'])]
See the Python demo
The regex matches:
(?P<perf_name>\w*) - 0 or more alphanumeric/underscore characters
\( - a literal (
(?P<perf_params>\w*(?:,\w+)*) - 0+ sequences of 0+ word characters (\w*) followed with 0+ sequences of 1+ word characters
\) - closing ).

Match first parenthesis with Python

From a string such as
70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30
I want to get the first parenthesized content linux;u;android4.2.1;zh-cn.
My code looks like this:
s=r'70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'
re.search("(\d+)\s.+\((\S+)\)", s).group(2)
but the result is the last brackets' contents khtml,likegecko.
How to solve this?
The main issue you have is the greedy dot matching .+ pattern. It grabs the whole string you have, and then backtracks, yielding one character from the right at a time, trying to accommodate for the subsequent patterns. Thus, it matches the last parentheses.
You can use
^(\d+)\s[^(]+\(([^()]+)\)
See the regex demo. Here, the [^(]+ restricts the matching to the characters other than ( (so, it cannot grab the whole line up to the end) and get to the first pair of parentheses.
Pattern expalantion:
^ - string start (NOTE: If the number appears not at the start of the string, remove this ^ anchor)
(\d+) - Group 1: 1 or more digits
\s - a whitespace (if it is not a required character, it can be removed since the subsequent negated character class will match the space)
[^(]+ - 1+ characters other than (
\( - a literal (
([^()]+) - Group 2 matching 1+ characters other than ( and )
\)- closing ).
Debuggex Demo
Here is the IDEONE demo:
import re
p = re.compile(r'^(\d+)\s[^(]+\(([^()]+)\)')
test_str = "70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30"
print(p.findall(test_str))
# or using re.search if the number is not at the beginning of the string
m = re.search(r'(\d+)\s[^(]+\(([^()]+)\)', test_str)
if m:
print("Number: {0}\nString: {1}".format(m.group(1), m.group(2)))
# [('70849', 'linux;u;android4.2.1;zh-cn')]
# Number: 70849
# String: linux;u;android4.2.1;zh-cn
You can use a negated class \(([^)]*)\) to match anything between ( and ):
>>> s=r'70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'
>>> m = re.search(r"(\d+)[^(]*\(([^)]*)\)", s)
>>> print m.group(1)
70849
>>> print m.group(2)
linux;u;android4.2.1;zh-cn

Categories

Resources