Remove title using regular expression - python

How to remove 2 or 3 characters at the begining of the string followed by a dot and may or may not be followed by a space?
i = 'mr.john'
i.replace("mr.","")
The above returns the name 'john' correctly but not in all cases. For e.g.
i = 'smr. john'
i.replace("mr.","")
's john'
Expected result was 'john'

If you needed a more generic approach (i possibly having more names), you may use this code. You can define your own prefixes to remove:
import re
prefixes = ['mr', 'smr']
regex = r'\b(?:' + '|'.join(prefixes) + r')\.\s*'
i = 'hi mr.john, smr. john, etc. Previous etc should not be removed'
i = re.sub(regex,'',i)
print(i)
You can test it live here
The created regex is this:
\b # Word boundary (to match 'mr' but not 'zmr' unless specified)
(?:group|of|prefixes|that|we|want|to|remove) # example
\. # Literal '.'
\s* # 0 or more spaces

You want two or three characters at the start of the string followed by a dot and then maybe a space. As a regular expression this looks like ^\w{2,3}\. ?.
Now you can use re.sub to replace this part with an empty string.
cleaned_name = re.sub(r'(^\w{2,3}\. ?)', r'', name)

Use str.find with slicing.
Ex:
i = 'smr. john'
print(i[i.find(".")+1:].strip())
i2 = 'mr.john'
print(i2[i2.find(".")+1:].strip())
Output:
john
john

Related

How to substitute only second occurrence of re.search() group

I need to replace part of the string value with extra zeroes if it needs.
T-46-5-В,Г,6-В,Г ---> T-46-005-В,Г,006-В,Г or
T-46-55-В,Г,56-В,Г ---> T-46-055-В,Г,066-В,Г, for example.
I have Regex pattern ^\D-\d{1,2}-([\d,]+)-[а-яА-я,]+,([\d,]+)-[а-яА-я,]+$ that retrieves 2 separate groups of the string, that i must change. The problem is I can't substitute back exact same groups with changed values if there is another occurrence of my re.search().group() in the whole string.
import re
my_string = "T-46-5-В,Г,6-В,Г"
my_pattern = r"^\D-\d{1,2}-([\d,]+)-[а-яА-я,]+,([\d,]+)-[а-яА-я,]+$"
new_string_parts = ["005", "006"]
new_string = re.sub(re.search(my_pattern, my_string).group(1), new_string_parts[0], my_string)
new_string = re.sub(re.search(my_pattern, my_string).group(2), new_string_parts[1], new_string)
print(new_string)
I get T-4006-005-В,Г,006-В,Г instead of T-46-005-В,Г,006-В,Г because there is another "6" in my_string. How can i solve this?
Thanks for your answers!
Capture the parts you need to keep and use a single re.sub pass with unambiguous backreferences in the replacement part (because they are mixed with numeric string variables):
import re
my_string = "T-46-5-В,Г,6-В,Г"
my_pattern = r"^(\D-\d{1,2}-)[\d,]+(-[а-яёА-ЯЁ,]+,)[\d,]+(-[а-яёА-ЯЁ,]+)$"
new_string_parts = ["005", "006"]
new_string = re.sub(my_pattern, fr"\g<1>{new_string_parts[0]}\g<2>{new_string_parts[1]}\3", my_string)
print(new_string)
# => T-46-005-В,Г,006-В,Г
See the Python demo. Note I also added ёЁ to the Russian letter ranges.
The pattern - ^(\D-\d{1,2}-)[\d,]+(-[а-яёА-ЯЁ,]+,)[\d,]+(-[а-яёА-ЯЁ,]+)$ - now contains parentheses around the parts you do not need to change, and \g<1> refers to the string captured with (\D-\d{1,2}-), \g<2> refers to the value captured with (-[а-яёА-ЯЁ,]+,) and \3 - to (-[а-яёА-ЯЁ,]+).

Why can I not use re.sub to replace a group?

My goal is to find a group in a string using regex and replace it with a space.
The group I am looking to find is a group of symbols only when they fall between strings. When I use re.findall() it works exactly as expected
word = 'This##Is # A # Test#'
print(word)
re.findall(r"[a-zA-Z\s]*([\$\#\%\!\s]*)[a-zA-Z]",word)
>>> ['##', '# ', '# ', '']
But when I use re.sub(), instead of replacing the group, it replaces the entire regex.
x = re.sub(r"[a-zA-Z\s]*([\$\#\%\!\s]*)[a-zA-Z]",r' ',word)
print(x)
>>> ' #'
How can I use regular expressions to replace ONLY the group? The outcome I expect is:
'This Is A Test#'
First, there's no need to escape every "magic" character within a character class, [$#%!\s]* is equally fine and much more readable.
Second, matching (i.e. retrieving) is different from replacing and you could use backreferences to achieve your goal.
Third, if you only want to have # at the end, you could help yourself with a much easier expression:
(?:[\s#](?!\Z))+
Which would then need to be replaced by a space, see a demo on regex101.com.
In Python this could be:
import re
string = "This##Is # A # Test#"
rx = re.compile(r'(?:[\s#](?!\Z))+')
new_string = rx.sub(' ', string)
print(new_string)
# This Is A Test#
You can group the portions of the pattern you want to retain and use backreferences in your replacement string instead:
x = re.sub(r"([a-zA-Z\s]*)[\$\#\%\!\s]*([a-zA-Z])", r'\1 \2', word)
The problem is that your regex matches the wrong thing entirely.
x = re.sub(r'\b[$#%!\s]+\b', ' ', word)

Repeated pattern in python regex

New to python regex and would like to write something that matches this
<name>.name.<age>.age#<place>
I can do this but would like the pattern to have and check name and age.
pat = re.compile("""
^(?P<name>.*)
\.
(?P<name>.*)
\.
(?P<age>.*)
\.
(?P<age>.*?)
\#
(?P<place>.*?)
$""", re.X)
I then match and extract the values.
res = pat.match('alan.name.65.age#jamaica')
Would like to know the best practice to do this?
Match .name and .age literally. You don't need new groups for that.
pat = re.compile("""
^(?P<name>[^.]*)\.name
\.
(?P<age>[^.]*)\.age
\#
(?P<place>.*)
$""", re.X)
Notes
I've replaced .* ("anything") by [^.]* ("anything except a dot"), because the dot cannot really be part of the name in the pattern you show.
Think whether you mean * (0-unlimited occurrences) or rather + (1-unlimited occurrences).
No reason not to allow . in names, e.g. John Q. Public.
import re
pat = re.compile(r"""(?P<name>.*?)\.name
\.(?P<age>\d+)\.age
#(?P<place>.*$)""",
flags=re.X)
m = pat.match('alan.name.65.age#jamaica')
print(m.group('name'))
print(m.group('age'))
print(m.group('place'))
Prints:
alan
65
jamaica
You dont need the groups if you use re.split :
re.split('\.name\.|\.age', "alan.name.65.age#jamaica")
This will return name and age as first two elements of the list.

How to replace .. in a string in python

I am trying to replace this string to become this
import re
s = "haha..hehe.hoho"
s = re.sub('[..+]+',' ', s)
my output i get haha hehe hoho
desired output
haha hehe.hoho
What am i doing wrong?
Test on sites like regexpal: http://regexpal.com/
It's easier to get the output and check if the regex is right.
You should change your regex to something like: '\.\.' if you want to remove only double dots.
If you want to remove when there's at least 2 dots you can use '\.{2,}'.
Every character you put inside a [] will be checked against your expression
And the dot character has a special meaning on a regex, to avoid this meaning you should prefix it with a escape character: \
You can read more about regular expressions metacharacters here: https://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php
[a-z] A range of characters. Matches any character in the specified
range.
. Matches any single character except "n".
\ Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
Your new code:
import re
s = "haha..hehe.hoho"
#pattern = '\.\.' #If you want to remove when there's 2 dots
pattern = '\.{2,}' #If you want to remove when there's at least 2 dots
s = re.sub(pattern, ' ', s)
Unless you are constrained to use regex, then I find the replace() function much simpler:
s = "haha..hehe.hoho"
print s.replace('..',' ')
gives your desired output:
haha hehe.hoho
Change:
re.sub('[..+]+',' ', s)
to:
re.sub('\.\.+',' ', s)
[..+]+ , this meaning in regex is that use the any in the list at least one time. So it matches the .. as well as . in your input. Make the changes as below:
s = re.sub('\.\.+',' ', s)
[] is a character class and will match on anything in it (meaning any 1 .).
I'm guessing you used it because a simple . wouldn't work, because it's a meta character meaning any character. You can simply escape it to mean a literal dot with a \. As such:
s = re.sub('\.\.',' ', s)
Here is what your regex means:
So, you allow for 1 or more literal periods or plus symbols, which is not the case.
You do not have to repeat the same symbol when looking for it, you can use quantifiers, like {2}, which means "exactly 2 occurrences".
You can use split and join, see sample working program:
import re
s = "haha..hehe.hoho"
s = " ".join(re.split(r'\.{2}', s))
print s
Output:
haha hehe.hoho
Or you can use the sub with the regex, too:
s = re.sub(r'\.{2}', ' ', "haha..hehe.hoho")
In case you have cases with more than 2 periods, you should use \.{2,} regex.

regular expression to split

I try to understand the regex in python. How can i split the following sentence with regular expression?
"familyname, Givenname A.15.10"
this is like the phonebook in python regex http://docs.python.org/library/re.html. The person maybe have 2 or more familynames and 2 or more givennames. After the familynames exist ', ' and after givennames exist ''. the last one is the office of the person. What i did until know is
import re
file=open('file.txt','r')
data=file.readlines()
for i in range(90):
person=re.split('[,\.]',data[i],maxsplit=2)
print(person)
it gives me a result like this
['Wegner', ' Sven Ake G', '15.10\n']
i want to have something like
['Wegner', ' Sven Ake', 'G', '15', '10']. any idea?
In the regex world it's often easier to "match" rather than "split". When you're "matching" you tell the RE engine directly what kinds of substrings you're looking for, instead of concentrating on separating characters. The requirements in your question are a bit unclear, but let's assume that
"surname" is everything before the first comma
"name" is everything before the "office"
"office" consists of non-space characters at the end of the string
This translates to regex language like this:
rr = r"""
^ # begin
([^,]+) # match everything but a comma
(.+?) # match everything, until next match occurs
(\S+) # non-space characters
$ # end
"""
Testing:
import re
rr = re.compile(rr, re.VERBOSE)
print rr.findall("de Batz de Castelmore d'Artagnan, Charles Ogier W.12.345")
# [("de Batz de Castelmore d'Artagnan", ', Charles Ogier ', 'W.12.345')]
Update:
rr = r"""
^ # begin
([^,]+) # match everything but a comma
[,\s]+ # a comma and spaces
(.+?) # match everything until the next match
\s* # spaces
([A-Z]) # an uppercase letter
\. # a dot
(\d+) # some digits
\. # a dot
(\d+) # some digits
\s* # maybe some spaces or newlines
$ # end
"""
import re
rr = re.compile(rr, re.VERBOSE)
s = 'Wegner, Sven Ake G.15.10\n'
print rr.findall(s)
# [('Wegner', 'Sven Ake', 'G', '15', '10')]
What you want to do is first split the family name by ,
familyname, rest = text.split(',', 1)
Then you want to split the office with the first space from the right.
givenname, office = rest.rsplit(' ', 1)
Assuming that family names don't have a comma, you can take them easily. Given names are sensible to dots. For example:
Harney, PJ A.15.10
Harvey, P.J. A.15.10
This means that you should probably trim the rest of the record (family names are out) by a mask at the end (regex "maskpattern$").

Categories

Resources