Regular expression finding '\n' - python

I'm in the process of making a program to pattern match phone numbers in text.
I'm loading this text:
(01111-222222)fdf
01111222222
(01111)222222
01111 222222
01111.222222
Into a variable, and using "findall" it's returning this:
('(01111-222222)', '(01111', '-', '222222)')
('\n011112', '', '\n', '011112')
('(01111)222222', '(01111)', '', '222222')
('01111 222222', '01111', ' ', '222222')
('01111.222222', '01111', '.', '222222')
This is my expression:
ex = re.compile(r"""(
(\(?0\d{4}\)?)? # Area code
(\s*\-*\.*)? # seperator
(\(?\d{6}\)?) # Local number
)""", re.VERBOSE)
I don't understand why the '\n' is being caught.
If * in '\\.*' is substituted for by '+', the expression works as I want it. Or if I simply remove *(and being happy to find the two sets of numbers separated by only a single period), the expression works.

The \s matches both horizontal and veritcal whitespace symbols. If you have a re.VERBOSE, you can match a normal space with an escaped space \ . Or, you may exclude \r and \n from \s with [^\S\r\n] to match horizontal whitespace.
Use
ex = re.compile(r"""(
(\(?0\d{4}\)?)? # Area code
([^\S\r\n]*-*\.*)? # seperator ((HERE))
(\(?\d{6}\)?) # Local number
)""", re.VERBOSE)
See the regex demo
Also, the - outside a character class does not require escaping.

Related

Replace one part of a pattern in a string/sentence?

There is a text blob for example
"Text blob1. Text blob2. Text blob3 45.6%. Text blob4."
I want to replace the dots i.e. "." with space " ". But at the same time, dots appearing between numbers should be retained. For example, the previous example should be converted to:
"Text blob1 Text blob2 Text blob3 45.6% Text blob4"
If I use:
p = re.compile('\.')
s = p.sub(' ', s)
It replaces all dots with space.
Any suggestions on what pattern or method works here?
Use
\.(?!(?<=\d\.)\d)
See proof. This expression will match any dot that has no digit after it that is preceded with a digit and a dot.
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of look-ahead
You might not need regex here. Replace a dot-space with a space.
s.replace('. ', ' ')
That isn't good enough if you have any periods followed by a newline or that terminate the string, but you still wouldn't need a regex:
s.replace('. ', ' ').replace('.\n', '\n').rstrip('.')
Suppose the string were
A.B.C blob3 45.6%. Text blob4.
Match all periods other than those both preceded and followed by a digit
If after replacements, the string
A B C blob3 45.6% Text blob4
were desired, one could use re.sub with the regular expression
r'(?<!\d)\.|\.(?!\d)'
to replace matches of periods with empty strings.
The regex reads, "match a period that is not preceded by a character other than a digit or is not followed by a character other than a digit".
Demo 1
The double-negative is employed to match a period at the beginning or end of the string. One could instead use the logical equivalent:
r'(?<=^|\D)\.|\.(?=\D|$)'
Match all periods except those both preceded and followed by a whitespace character
On the other hand, if, after substitutions, the string
A.B.C blob3 45.6% Text blob4
were desired one could use re.sub with the regular expression
r'(?<!\S)\.|\.(?!\S)'
to replace matches of periods with empty strings.
This regex reads, "match a period that is not preceded by a character other than a whitespace or is not followed by a character other than a whitespace".
Demo 2
One could instead use the logical equivalent:
r'(?<=^|\s)\.|\.(?=\s|$)'

How to add a space after a diacritical mark only if no space came after it

I'm using this regular expression to remove an Arabic diacritical mark from a subtitle file, How it could be modified to add a space after the diacritical mark only if no space came after the diacritical mark? I'm using python 2.7.
file_content = re.sub(u'\u0651', '', file_content)
like
أعطني المفكّ، الآن
I need to add space after ّ
to be
أعطني المفكّ ، الآن
With regular expressions you could search for all occurrences of your dictation mark that has no space immediately after it:
file_content = re.sub(u'\u0651[^ ]', '\u0651 ', file_content)
[^ ] would mean any character that is not a simple whitespace.
\S would also be possible instead of [^ ], since it would match anything that is not a space.
https://docs.python.org/2/library/re.html
[] Used to indicate a set of characters.
Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.
\S
Matches any character which is not a whitespace character. This is the opposite of \s. If the ASCII flag is used this becomes the equivalent of [^ \t\n\r\f\v].

Combining regular expressions in Python - \W and \S

I want my code to only return the special characters [".", "*", "=", ","]
I want to remove all digits/alphabetical characters ("\W") and all white spaces ("\S")
import re
original_string = "John is happy. He owns 3*4=12, apples"
new_string = re.findall("\W\S",original_string)
print(new_string)
But instead I get this as my output:
[' i', ' h', ' H', ' o', ' 3', '*4', '=1', ' a']
I have absolutely no idea why this happens. Hence I have two questions:
1) Is it possible to achieve my goal using regular expressions
2) What is actually going on with my code?
You were close, but you need to specify these escape sequences inside a character class.
re.findall(r'[^\w\s]', original_string)
# ['.', '*', '=', ',']
Note that the caret ^ indicates negation (i.e., don't match these characters).
Alternatively, instead of removing what you don't need, why not extract what you do?
re.findall(r'[.*=,]', original_string)
# ['.', '*', '=', ',']
Here, we can also add our desired special chars in a [], swipe everything else, and then collect only those chars:
([\s\S].*?)([.*=,])?
Python Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"([\s\S].*?)([.*=,])?"
test_str = "John is happy. He owns 3*4=12, apples"
subst = "\\2"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
JavaScript Demo
const regex = /([\s\S].*?)([.*=,])?/gm;
const str = `John is happy. He owns 3*4=12, apples`;
const subst = `$2`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
RegEx
If this wasn't our desired expression, we can modify/change it in regex101.com.
RegEx Circuit
We can also visualize expressions in jex.im:
Demo
The regular expression \W\S matches a sequence of two characters; one non-word, and one non-space. If you want to combine them, that's [^\w\s] which matches one character which does not belong to either the word or the whitespace group.
However, there are many characters which are not one of the ones you enumerate which match this expression. If you want to remove characters which are not in your set, the character class containing exactly all those characters is simply [^.*=,]
Perhaps it's worth noting that inside [...] you don't need to (and in fact should not) backslash-escape e.g. the literal dot. By default, a character class cannot match a newline character, though there is an option re.DOTALL to change this.
If you are trying to extract and parse numerical expressions, regex can be a useful part of the lexical analysis, but you really want a proper parser.

Python regex : adding space after comma only if not followed by a number

I want to add spaces after and before comma's in a string only if the following character isn't a number (9-0). I tried the following code:
newLine = re.sub(r'([,]+[^0-9])', r' \1 ', newLine)
But it seems like the \1 is taking the 2 matching characters and not just the comma.
Example:
>>> newLine = "abc,abc"
>>> newLine = re.sub(r'([,]+[^0-9])', r' \1 ', newLine)
"abc ,a bc"
Expected Output:
"abc , abc"
How can I tell the sub to take only the 'comma' ?
Use this one:
newLine = re.sub(r'[,]+(?![0-9])', r' , ', newLine)
Here using negative lookahead (?![0-9]) it is checking that the comma(s) are not followed by a digit.
Your regex didn't work because you picked the comma and the next character(using ([,]+[^0-9])) in a group and placed space on both sides.
UPDATE: If it is not only comma and other things as well, then place them inside the character class [] and capture them in group \1 using ()
newLine = re.sub(r'([,/\\]+)(?![0-9])', r' \1 ', newLine)

regular expression to split

I try to understand the regex in python. How can i split the following sentence with regular expression?
"familyname, Givenname A.15.10"
this is like the phonebook in python regex http://docs.python.org/library/re.html. The person maybe have 2 or more familynames and 2 or more givennames. After the familynames exist ', ' and after givennames exist ''. the last one is the office of the person. What i did until know is
import re
file=open('file.txt','r')
data=file.readlines()
for i in range(90):
person=re.split('[,\.]',data[i],maxsplit=2)
print(person)
it gives me a result like this
['Wegner', ' Sven Ake G', '15.10\n']
i want to have something like
['Wegner', ' Sven Ake', 'G', '15', '10']. any idea?
In the regex world it's often easier to "match" rather than "split". When you're "matching" you tell the RE engine directly what kinds of substrings you're looking for, instead of concentrating on separating characters. The requirements in your question are a bit unclear, but let's assume that
"surname" is everything before the first comma
"name" is everything before the "office"
"office" consists of non-space characters at the end of the string
This translates to regex language like this:
rr = r"""
^ # begin
([^,]+) # match everything but a comma
(.+?) # match everything, until next match occurs
(\S+) # non-space characters
$ # end
"""
Testing:
import re
rr = re.compile(rr, re.VERBOSE)
print rr.findall("de Batz de Castelmore d'Artagnan, Charles Ogier W.12.345")
# [("de Batz de Castelmore d'Artagnan", ', Charles Ogier ', 'W.12.345')]
Update:
rr = r"""
^ # begin
([^,]+) # match everything but a comma
[,\s]+ # a comma and spaces
(.+?) # match everything until the next match
\s* # spaces
([A-Z]) # an uppercase letter
\. # a dot
(\d+) # some digits
\. # a dot
(\d+) # some digits
\s* # maybe some spaces or newlines
$ # end
"""
import re
rr = re.compile(rr, re.VERBOSE)
s = 'Wegner, Sven Ake G.15.10\n'
print rr.findall(s)
# [('Wegner', 'Sven Ake', 'G', '15', '10')]
What you want to do is first split the family name by ,
familyname, rest = text.split(',', 1)
Then you want to split the office with the first space from the right.
givenname, office = rest.rsplit(' ', 1)
Assuming that family names don't have a comma, you can take them easily. Given names are sensible to dots. For example:
Harney, PJ A.15.10
Harvey, P.J. A.15.10
This means that you should probably trim the rest of the record (family names are out) by a mask at the end (regex "maskpattern$").

Categories

Resources