Python Split Regex not split what I need - python

I have this in my file
import re
sample = """Name: #s
Owner: #a[tag=Admin]"""
target = r"#[sae](\[[\w{}=, ]*\])?"
regex = re.split(target, sample)
print(regex)
I want to split all words that start with #, so like this:
["Name: ", "#s", "\nOwner: ", "#a[tag=Admin]"]
But instead it give this:
['Name: ', None, '\nOwner: ', '[tag=Admin]', '']
How to seperating it?

I would use re.findall here:
sample = """Name: #s
Owner: #a[tag=Admin]"""
parts = re.findall(r'#\w+(?:\[.*?\])?|\s*\S+\s*', sample)
print(parts) # ['Name: ', '#s', '\nOwner: ', '#a[tag=Admin]']
The regex pattern used here says to match:
#\w+ a tag #some_tag
(?:\[.*?\])? followed by an optional [...] term
| OR
\s*\S+\s* any other non whitespace term,
including optional whitespace on both sides

If I understand the requirements correctly you could do that as follows:
import re
s = """Name: #s
Owner: #a[tag=Admin]
"""
rgx = r'(?=#.*)|(?=\r?\n[^#\r\n]*)'
re.split(rgx, s)
#=> ['Name: ', '#s', '\nOwner: ', '#a[tag=Admin]\n']
Demo
The regular expression can be broken down as follows.
(?= # begin a positive lookahead
#.* # match '#' followed by >= 0 chars other than line terminators
) # end positive lookahead
| # or
(?= # begin a positive lookahead
\r?\n # match a line terminator
[^#\r\n]* # match >= 0 characters other than '#' and line terminators
) # end positive lookahead
Notice that matches are zero-width.

re.split expects the regular expression to match the delimiters in the string. It only returns the parts of the delimiters which are captured. In the case of your regex, that's only the part between the brackets, if present.
If you want the whole delimiter to show up in the list, put parentheses around the whole regex:
target = r"(#[sae](\[[\w{}=, ]*\])?)"
But you are probably better off not capturing the interior group. You can change it to a non-capturing group by using (?:…) instead of (…):
target = r"(#[sae](?:\[[\w{}=, ]*\])?)"

In your output, you keep the [tag=Admin] as that part is in a capture group, and using split can also return empty strings.
Another option is to be specific about the allowed data format, and instead of split capture the parts in 2 groups.
(\s*\w+:\s*)(#[sae](?:\[[\w{}=, ]*])?)
The pattern matches:
( Capture group 1
\s*\w+:\s* Match 1+ word characters and : between optional whitespace chars
) Close group
( Capture group 2
#[sae] Match # followed by either s a e
(?:\[[\w{}=, ]*])? Optionally match [...]
) Close group
Example code:
import re
sample = """Name: #s
Owner: #a[tag=Admin]"""
target = r"(\s*\w+:\s*)(#[sae](?:\[[\w{}=, ]*])?)"
listOfTuples = re.findall(target, sample)
lst = [s for tpl in listOfTuples for s in tpl]
print(lst)
Output
['Name: ', '#s', '\nOwner: ', '#a[tag=Admin]']
See a regex demo and a Python demo.

Related

Python regex US phone numbers - need to split '6502221111' into '650' '222' '1111' [duplicate]

I have a regex that parses US phone numbers into 3 strings.
import re
s = ' 916-2221111 ' # this also works'(916) 222-1111 '
reg_ph = re.match(r'^\s*\(?(\d{3})\)?-? *(\d{3})-? *-?(\d{4})', s)
if reg_ph:
return reg_ph.groups()
else:
raise ValueError ('not a valid phone number')
it works perfectly on the numbers:
'(916) 222-1111 '
' 916-2221111 '
Now I need to add an additional regex to generate a Value Error for numbers such as
s = '916 111-2222' # there are white spaces between the area code and a local number and NO ')'
I tried
reg_ph = re.match(r'^\s*\(?(\d{3})\)?\s*-? *(\d{3})-? *-?(\d{4})', s)
reg_ph = re.match(r'^\s*\(?(\d{3})\)?s*-? *(\d{3})-? *-?(\d{4})', s)
but non rejects the string in question
I will greatly appreciate any ideas. I am very new to Regex!
In Python re you could use a conditional to check for group 1 having the opening parenthesis.
If that is the case match the closing parenthesis, optional spaces and 3 digits. Else match - and 3 digits.
If you use re.match you can omit ^
^\s*(\()?\d+(?(1)\)\s*\d{3}|-\d{3})-?\d{4}
If you want to match the whole string and trailing whitespace chars:
^\s*(\()?\d+(?(1)\)\s*\d{3}|-\d{3})-?\d{4}\s*$
In parts, the pattern matches:
^ Start of string
\s* Match optional whitespace chars
(\()? Optional group 1, match (
\d+ Match 1+ digits
(? Conditional
(1)\)\s*\d{3} If group 1 exist, match the closing ), optional whitespace chars and 3 digits
| Or
-? Match optional -
\d{3} Match 3 digits
) close conditional
-?\d{4} Match optional - and 4 digits
See a regex demo
For example, using capture groups in the pattern to get the digits:
import re
strings = [' (916) 111-2222',' 916-2221111 ', '916 111-2222']
pattern =r'\s*(\()?(\d+)(?(1)\)\s*(\d{3})|-(\d{3}))-?(\d{4})\s*$'
for item in strings:
m=re.match(pattern, item)
if m:
t = tuple(s for s in m.groups() if s is not None and s.isdigit())
print(t)
else:
print("no match for " + item)
Output
('916', '111', '2222')
('916', '222', '1111')
no match for 916 111-2222
Python demo

Splitting a string by multiple possible delimiters

I want to parse a str into a list of float values, however I want to be flexible regarding my delimiters. Specifically, I would like to be able to use any of these
s = '3.14; 42.2' # delimiter is '; '
s = '3.14;42.2' # delimiter is ';'
s = '3.14, 42.2' # delimiter is ', '
s = '3.14,42.2' # delimiter is ','
s = '3.14 42.2' # delimiter is ' '
I thought about removing all spaces, but this would disable the last version; I tried the re.split()-function by doing re.split('[;, ]', s) which would work using a single character as delimiter but fails otherwise.
I can however do
s.replace('; ', ';').replace(', ', ';').replace(',', ';').replace(' ', ';')
s.split(';')
which works but seems not really like a good practice or useful - especially if I would add even more delimiters in the future. What would be a good approach to do this?
You can use re.split and split on (The [ ] is a space and the brackets are for display only)
[;,] ?|[ ]
The pattern matches
[;,] ? Match either ; or , followed by an optional space
| or
[ ] Match a single space
Regex demo | Python demo
A bit more strict pattern with lookarounds could be asserting a digit on the left using lookarounds.
(?<=\d)(?:[;,] ?| )(?=\d)
The pattern matches:
(?<=\d) Positive lookbehind, assert a digit to the left
(?: Non capture group for the alternation
[;,] ? Match either ; or , followed by an optional space
| Or
Match a space
) Close non capture group
(?=\d) Positive lookahead, assert a digit to the right
Regex demo
Example code
import re
strings = [
"3.14; 42.2",
"3.14;42.2",
"3.14, 42.2",
"3.14,42.2",
"3.14 42.2"
]
for s in strings:
print(re.split(r"[;,] ?| ", s))
Output
['3.14', '42.2']
['3.14', '42.2']
['3.14', '42.2']
['3.14', '42.2']
['3.14', '42.2']
I think you can account for the last space(s) like this:
re.split(r'[;,]\s*', s)
Here \s* will capture the spaces after the separator, if any.
can also just do:
res = re.split('; |;|,|, | ', data)
see https://www.geeksforgeeks.org/python-split-multiple-characters-from-string/
Assuming you would know the delimiter of the input ahead of time, you could write a function that takes your delimiter as an argument, replaces with a space, and splits it:
def split_on_delim(strng, delim):
return strng.replace(delim, ' ').split()
for example:
>>> s = '3.14; 42.2'
>>> split_on_delim(s, '; ')
['3.14', '42.2']

Regex pattern to find n non-space characters of x length after a certain substring

I am using this regex pattern pattern = r'cig[\s:.]*(\w{10})' to extract the 10 characters after the '''cig''' contained in each line of my dataframe. With this pattern I am accounting for all cases, except for the ones where that substring contains some spaces inside it.
For example, I am trying to extract Z9F27D2198 from the string
/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031
In the previous string, it seems like Stack overflow formatted it, but there should be 17 whitespaces between F and 2, after CIG.
Could you help me to edit the regex pattern in order to account for the white spaces in that 10-characters substring? I am also using flags=re.I to ignore the case of the strings in my re.findall calls.
To give an example string for which this pattern works:
CIG7826328A2B FORNITURA ENERGIA ELETTRICA U TENZE COMUNALI CONVENZIONE CONSIP E
and it outputs what I want: 7826328A2B.
Thanks in advance.
You can use
r'(?i)cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
See the regex demo. Details:
cig - a cig string
[\s:.]* - zero or more whitespaces, : or .
(\S(?:\s*\S){9}) - Group 1: a non-whitespace char and then nine occurrences of zero or more whitespaces followed with a non-whitespace char
(?!\S) - immediately to the right, there must be a whitespace or end of string.
In Python, you can use
import re
text = "/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031"
pattern = r'cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
matches = re.finditer(pattern, text, re.I)
for match in matches:
print(re.sub(r'\s+', '', match.group(1)), ' found at ', match.span(1))
# => Z9F27D2198 found at (32, 57)
See the Python demo.
What about:
# removes all white spaces with replace()
x = 'CIG7826328A2B FORNITURA ENERGIA ELETTRICA U'.replace(' ', '')
x = x.split("CIG")[1][:10]
# x = '7826328A2B'
x = '/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031'.replace(' ', '')
x.split("CIG")[1][:10]
# x = '7826328A2B'
Works fine if there is only one "CIG" in the string

Regex : matching integers inside of brackets

I am trying to take off bracketed ends of strings such as version = 10.9.8[35]. I am trying to substitute the integer within brackets pattern
(so all of [35], including brackets) with an empty string using the regex [\[+0-9*\]+] but this also matches with numbers not surrounded by brackets. Am I not using the + quantifier properly?
You could match the format of the number and then match one or more digits between square brackets.
In the replacement using the first capturing group r'\1'
\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]
\b Word boundary
( Capture group 1
[0-9]+ Match 1+ digits
(?:\.[0-9]+)+ Match a . and 1+ digits and repeat that 1 or more times
) Close group
\[[0-9]+\] Match 1+ digits between square brackets
Regex demo
For example
import re
regex = r"\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]"
test_str = "version = 10.9.8[35]"
result = re.sub(regex, r'\1', test_str)
print (result)
Output
version = 10.9.8
No need for regex
s = '10.9.8[35]'
t = s[:s.rfind("[")]
print(t)
But if you insist ;-)
import re
s = '10.9.8[35]'
t = re.sub(r"^(.*?)[[]\d+[]]$", r"\1", s)
print(t)
Breakdown of regex:
^ - begins with
() - Capture Group 1 you want to keep
.*? - Any number of chars (non-greedy)
[[] - an opening [
\d+ 1+ digit
[]] - closing ]
$ - ends with
\1 - capture group 1 - used in replace part of regex replace. The bit you want to keep.
Output in both cases:
10.9.8
Use regex101.com to familiarise yourself more. If you click on any of the regex samples at bottom right of the website, it will give you more info. You can also use it to generate regex code in a variety of languages too. (not good for Java though!).
There's also a great series of Python regex videos on Youtube by PyMoondra.
A simpler regex solution:
import re
pattern = re.compile(r'\[\d+\]$')
s = '10.9.8[35]'
r = pattern.sub('', s)
print(r) # 10.9.8
The pattern matches square brackets at the end of a string with one or more number inside. The sub then replaces the square brackets and number with an empty string.
If you wanted to use the number in the square brackets just change the sub expression such as:
import re
pattern = re.compile(r'\[(\d+)\]$')
s = '10.9.8[35]'
r = pattern.sub(r'.\1', s)
print(r) # 10.9.8.35
Alternatively as said by the other answer you can just find it and splice to get rid of it.

regular expression to split

I try to understand the regex in python. How can i split the following sentence with regular expression?
"familyname, Givenname A.15.10"
this is like the phonebook in python regex http://docs.python.org/library/re.html. The person maybe have 2 or more familynames and 2 or more givennames. After the familynames exist ', ' and after givennames exist ''. the last one is the office of the person. What i did until know is
import re
file=open('file.txt','r')
data=file.readlines()
for i in range(90):
person=re.split('[,\.]',data[i],maxsplit=2)
print(person)
it gives me a result like this
['Wegner', ' Sven Ake G', '15.10\n']
i want to have something like
['Wegner', ' Sven Ake', 'G', '15', '10']. any idea?
In the regex world it's often easier to "match" rather than "split". When you're "matching" you tell the RE engine directly what kinds of substrings you're looking for, instead of concentrating on separating characters. The requirements in your question are a bit unclear, but let's assume that
"surname" is everything before the first comma
"name" is everything before the "office"
"office" consists of non-space characters at the end of the string
This translates to regex language like this:
rr = r"""
^ # begin
([^,]+) # match everything but a comma
(.+?) # match everything, until next match occurs
(\S+) # non-space characters
$ # end
"""
Testing:
import re
rr = re.compile(rr, re.VERBOSE)
print rr.findall("de Batz de Castelmore d'Artagnan, Charles Ogier W.12.345")
# [("de Batz de Castelmore d'Artagnan", ', Charles Ogier ', 'W.12.345')]
Update:
rr = r"""
^ # begin
([^,]+) # match everything but a comma
[,\s]+ # a comma and spaces
(.+?) # match everything until the next match
\s* # spaces
([A-Z]) # an uppercase letter
\. # a dot
(\d+) # some digits
\. # a dot
(\d+) # some digits
\s* # maybe some spaces or newlines
$ # end
"""
import re
rr = re.compile(rr, re.VERBOSE)
s = 'Wegner, Sven Ake G.15.10\n'
print rr.findall(s)
# [('Wegner', 'Sven Ake', 'G', '15', '10')]
What you want to do is first split the family name by ,
familyname, rest = text.split(',', 1)
Then you want to split the office with the first space from the right.
givenname, office = rest.rsplit(' ', 1)
Assuming that family names don't have a comma, you can take them easily. Given names are sensible to dots. For example:
Harney, PJ A.15.10
Harvey, P.J. A.15.10
This means that you should probably trim the rest of the record (family names are out) by a mask at the end (regex "maskpattern$").

Categories

Resources