regular expression to split - python

I try to understand the regex in python. How can i split the following sentence with regular expression?
"familyname, Givenname A.15.10"
this is like the phonebook in python regex http://docs.python.org/library/re.html. The person maybe have 2 or more familynames and 2 or more givennames. After the familynames exist ', ' and after givennames exist ''. the last one is the office of the person. What i did until know is
import re
file=open('file.txt','r')
data=file.readlines()
for i in range(90):
person=re.split('[,\.]',data[i],maxsplit=2)
print(person)
it gives me a result like this
['Wegner', ' Sven Ake G', '15.10\n']
i want to have something like
['Wegner', ' Sven Ake', 'G', '15', '10']. any idea?

In the regex world it's often easier to "match" rather than "split". When you're "matching" you tell the RE engine directly what kinds of substrings you're looking for, instead of concentrating on separating characters. The requirements in your question are a bit unclear, but let's assume that
"surname" is everything before the first comma
"name" is everything before the "office"
"office" consists of non-space characters at the end of the string
This translates to regex language like this:
rr = r"""
^ # begin
([^,]+) # match everything but a comma
(.+?) # match everything, until next match occurs
(\S+) # non-space characters
$ # end
"""
Testing:
import re
rr = re.compile(rr, re.VERBOSE)
print rr.findall("de Batz de Castelmore d'Artagnan, Charles Ogier W.12.345")
# [("de Batz de Castelmore d'Artagnan", ', Charles Ogier ', 'W.12.345')]
Update:
rr = r"""
^ # begin
([^,]+) # match everything but a comma
[,\s]+ # a comma and spaces
(.+?) # match everything until the next match
\s* # spaces
([A-Z]) # an uppercase letter
\. # a dot
(\d+) # some digits
\. # a dot
(\d+) # some digits
\s* # maybe some spaces or newlines
$ # end
"""
import re
rr = re.compile(rr, re.VERBOSE)
s = 'Wegner, Sven Ake G.15.10\n'
print rr.findall(s)
# [('Wegner', 'Sven Ake', 'G', '15', '10')]

What you want to do is first split the family name by ,
familyname, rest = text.split(',', 1)
Then you want to split the office with the first space from the right.
givenname, office = rest.rsplit(' ', 1)

Assuming that family names don't have a comma, you can take them easily. Given names are sensible to dots. For example:
Harney, PJ A.15.10
Harvey, P.J. A.15.10
This means that you should probably trim the rest of the record (family names are out) by a mask at the end (regex "maskpattern$").

Related

Python Split Regex not split what I need

I have this in my file
import re
sample = """Name: #s
Owner: #a[tag=Admin]"""
target = r"#[sae](\[[\w{}=, ]*\])?"
regex = re.split(target, sample)
print(regex)
I want to split all words that start with #, so like this:
["Name: ", "#s", "\nOwner: ", "#a[tag=Admin]"]
But instead it give this:
['Name: ', None, '\nOwner: ', '[tag=Admin]', '']
How to seperating it?
I would use re.findall here:
sample = """Name: #s
Owner: #a[tag=Admin]"""
parts = re.findall(r'#\w+(?:\[.*?\])?|\s*\S+\s*', sample)
print(parts) # ['Name: ', '#s', '\nOwner: ', '#a[tag=Admin]']
The regex pattern used here says to match:
#\w+ a tag #some_tag
(?:\[.*?\])? followed by an optional [...] term
| OR
\s*\S+\s* any other non whitespace term,
including optional whitespace on both sides
If I understand the requirements correctly you could do that as follows:
import re
s = """Name: #s
Owner: #a[tag=Admin]
"""
rgx = r'(?=#.*)|(?=\r?\n[^#\r\n]*)'
re.split(rgx, s)
#=> ['Name: ', '#s', '\nOwner: ', '#a[tag=Admin]\n']
Demo
The regular expression can be broken down as follows.
(?= # begin a positive lookahead
#.* # match '#' followed by >= 0 chars other than line terminators
) # end positive lookahead
| # or
(?= # begin a positive lookahead
\r?\n # match a line terminator
[^#\r\n]* # match >= 0 characters other than '#' and line terminators
) # end positive lookahead
Notice that matches are zero-width.
re.split expects the regular expression to match the delimiters in the string. It only returns the parts of the delimiters which are captured. In the case of your regex, that's only the part between the brackets, if present.
If you want the whole delimiter to show up in the list, put parentheses around the whole regex:
target = r"(#[sae](\[[\w{}=, ]*\])?)"
But you are probably better off not capturing the interior group. You can change it to a non-capturing group by using (?:…) instead of (…):
target = r"(#[sae](?:\[[\w{}=, ]*\])?)"
In your output, you keep the [tag=Admin] as that part is in a capture group, and using split can also return empty strings.
Another option is to be specific about the allowed data format, and instead of split capture the parts in 2 groups.
(\s*\w+:\s*)(#[sae](?:\[[\w{}=, ]*])?)
The pattern matches:
( Capture group 1
\s*\w+:\s* Match 1+ word characters and : between optional whitespace chars
) Close group
( Capture group 2
#[sae] Match # followed by either s a e
(?:\[[\w{}=, ]*])? Optionally match [...]
) Close group
Example code:
import re
sample = """Name: #s
Owner: #a[tag=Admin]"""
target = r"(\s*\w+:\s*)(#[sae](?:\[[\w{}=, ]*])?)"
listOfTuples = re.findall(target, sample)
lst = [s for tpl in listOfTuples for s in tpl]
print(lst)
Output
['Name: ', '#s', '\nOwner: ', '#a[tag=Admin]']
See a regex demo and a Python demo.

Regex pattern to find n non-space characters of x length after a certain substring

I am using this regex pattern pattern = r'cig[\s:.]*(\w{10})' to extract the 10 characters after the '''cig''' contained in each line of my dataframe. With this pattern I am accounting for all cases, except for the ones where that substring contains some spaces inside it.
For example, I am trying to extract Z9F27D2198 from the string
/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031
In the previous string, it seems like Stack overflow formatted it, but there should be 17 whitespaces between F and 2, after CIG.
Could you help me to edit the regex pattern in order to account for the white spaces in that 10-characters substring? I am also using flags=re.I to ignore the case of the strings in my re.findall calls.
To give an example string for which this pattern works:
CIG7826328A2B FORNITURA ENERGIA ELETTRICA U TENZE COMUNALI CONVENZIONE CONSIP E
and it outputs what I want: 7826328A2B.
Thanks in advance.
You can use
r'(?i)cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
See the regex demo. Details:
cig - a cig string
[\s:.]* - zero or more whitespaces, : or .
(\S(?:\s*\S){9}) - Group 1: a non-whitespace char and then nine occurrences of zero or more whitespaces followed with a non-whitespace char
(?!\S) - immediately to the right, there must be a whitespace or end of string.
In Python, you can use
import re
text = "/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031"
pattern = r'cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
matches = re.finditer(pattern, text, re.I)
for match in matches:
print(re.sub(r'\s+', '', match.group(1)), ' found at ', match.span(1))
# => Z9F27D2198 found at (32, 57)
See the Python demo.
What about:
# removes all white spaces with replace()
x = 'CIG7826328A2B FORNITURA ENERGIA ELETTRICA U'.replace(' ', '')
x = x.split("CIG")[1][:10]
# x = '7826328A2B'
x = '/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031'.replace(' ', '')
x.split("CIG")[1][:10]
# x = '7826328A2B'
Works fine if there is only one "CIG" in the string

Remove title using regular expression

How to remove 2 or 3 characters at the begining of the string followed by a dot and may or may not be followed by a space?
i = 'mr.john'
i.replace("mr.","")
The above returns the name 'john' correctly but not in all cases. For e.g.
i = 'smr. john'
i.replace("mr.","")
's john'
Expected result was 'john'
If you needed a more generic approach (i possibly having more names), you may use this code. You can define your own prefixes to remove:
import re
prefixes = ['mr', 'smr']
regex = r'\b(?:' + '|'.join(prefixes) + r')\.\s*'
i = 'hi mr.john, smr. john, etc. Previous etc should not be removed'
i = re.sub(regex,'',i)
print(i)
You can test it live here
The created regex is this:
\b # Word boundary (to match 'mr' but not 'zmr' unless specified)
(?:group|of|prefixes|that|we|want|to|remove) # example
\. # Literal '.'
\s* # 0 or more spaces
You want two or three characters at the start of the string followed by a dot and then maybe a space. As a regular expression this looks like ^\w{2,3}\. ?.
Now you can use re.sub to replace this part with an empty string.
cleaned_name = re.sub(r'(^\w{2,3}\. ?)', r'', name)
Use str.find with slicing.
Ex:
i = 'smr. john'
print(i[i.find(".")+1:].strip())
i2 = 'mr.john'
print(i2[i2.find(".")+1:].strip())
Output:
john
john

Do not match word boundary beetwen parenthesis with python regex

I actually have:
regex = r'\bon the\b'
but need my regex to match only if this keyword (actually "on the") is not between parentheses in the text:
should match:
john is on the beach
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)
should not match:
(my son is )on the beach
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)
I don't think that regex would help you here for a general case.
for your examples, this regex would work as you want it to:
((?<=[^\(\)].{3})\bon the\b(?=.{3}[^\(\)])
description:
(?<=[^\(\)].{3}) Positive Lookbehind - Assert that the regex below
can be matched
[^\(\)] match a single character not present in the list below
\( matches the character ( literally
\) matches the character ) literally
.{3} matches any character (except newline)
Quantifier: Exactly 3 times
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
on the matches the characters on the literally (case sensitive)
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
(?=.{3}[^\(\)]) Positive Lookahead - Assert that the regex below
can be matched
.{3} matches any character (except newline)
Quantifier: Exactly 2 times
[^\(\)] match a single character not present in the list below
\( matches the character ( literally
\) matches the character ) literally
if you want to generalize the problem to any string between the parentheses and the string you are searching for, this will not work with this regex.
the issue is the length of that string between parentheses and your string. In regex the Lookbehind quantifiers are not allowed to be indefinite.
In my regex I used positive Lookahead and positive Lookbehind, the same result could be achieved as well with negative ones, but the issue remains.
Suggestion: write a small python code which can check a whole line if it contain your text not between parentheses, as regex alone can't do the job.
example:
import re
mystr = 'on the'
unWanted = re.findall(r'\(.*'+mystr+'.*\)|\)'+mystr, data) # <- here you put the un-wanted string series, which is easy to define with regex
# delete un-wanted strings
for line in mylist:
for item in unWanted:
if item in line:
mylist.remove(line)
# look for what you want
for line in mylist:
if mystr in line:
print line
where:
mylist: a list contains all the lines you want to search through.
mystr: the string you want to find.
Hope this helped.
In UNIX, grep utility using the following regular expression will be sufficient,
grep " on the " input_file_name | grep -v "\(.* on the .*\)"
How about something like this: ^(.*)(?:\(.*\))(.*)$ see it in action.
As you requested, it "matches only words that are not between parentheses in the text"
So, from:
some text (more text in parentheses) and some not in parentheses
Matches: some text + and some not in parentheses
More examples at the link above.
EDIT: changing answer since the question was changed.
To capture all mentions not within parentheses I'd use some code instead of a huge regex.
Something like this will get you close:
import re
pattern = r"(on the)"
test_text = '''john is on the bich
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)
(my son is )on the bitch
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)'''
match_list = test_text.split('\n')
for line in match_list:
print line, "->",
bracket_pattern = r"(\(.*\))" #remove everything between ()
brackets = re.findall(bracket_pattern, line)
for match in brackets:
line = line.replace(match,"")
matches = re.findall(pattern, line)
for match in matches:
print match
print "\r"
Output:
john is on the bich -> on the
let me put this on the fridge -> on the
he (my son) is on the beach -> on the
arnold is on the road (to home) -> on the
(my son is )on the bitch -> on the (this in the only one that doesn't work)
john is at the beach ->
bob is at the pool (berkeley) ->
the spon (is on the table) ->

python regex: use first blank as sep but maintain rest of blank sequence

I'm fighting too long on this regex now.
The split should use blank as separator
but maintain the remaining ones in a blank sequence to the next token
'123 45 678 123.0'
=>
'123', '45', ' 678', ' 123.0'
My numbers are floats as well and the group count is unknown.
What about using a lookbehind assertion?:
>>> import re
>>> regex = re.compile(r'(?<=[^\s])\s')
>>> regex.split('this is a string')
['this', ' is', 'a', ' string']
regex breakdown:
(?<=...) #lookbehind. Only match if the `...` matches before hand
[^\s] #Anything that isn't whitespace
\s #single whitespace character
In english, this translates to "match a single whitespace character if it isn't preceded by a whitespace character."
Or you can use a negative lookbehind assertion:
regex = re.compile(r'(?<!\s)\s')
which might be slightly nicer (as suggested in the comments), and should be relatively easy to figure out how it works since it is very similar to the above.

Categories

Resources