Remove digits from the string if they are concatenated using Regex - python

I am trying to remove the digits from the text only if they are concatenated with the alphabets or coming between characters in a word. But not with the dates.
Like
if "21st" then should remain "21st"
But if "alphab24et" should be "alphabet"
but if the digits come separately like "26 alphabets"
then it should remain "26 alphabets" .
I am using the below regex
newString = re.sub(r'[0-9]+', '', newString)
, which removes digits in ay position they occur, like in the above example it removes 26 as well.

You can match digits that are not enclosed with word boundaries with custom digit boundaries:
import re
newString = 'Like if "21st" then should remain "21st" But if "alphab24et" should be "alphabet" but if the digits come separately like "26 alphabets" then it should remain "26 alphabets" .'
print( re.sub(r'\B(?<!\d)[0-9]+\B(?!\d)', '', newString) )
# => Like if "21st" then should remain "21st" But if "alphabet" should be "alphabet" but if the digits come separately like "26 alphabets" then it should remain "26 alphabets" .
See the Python demo and the regex demo.
Details:
\B(?<!\d) - a non-word boundary position with no digit immediately on the left
[0-9]+ - one or more digits
\B(?!\d) - a non-word boundary position with no digit immediately on the right.

I find a way to make my re.sub's cleaner is to capture the things around my pattern in groups ((...) below), and put them back in the subsitute pattern (\1 and \2 below).
In your case you want to catch digit sequences ([0-9]+) that are not surrounded by white spaces (\s, since you want to keep those) or other other digits ([0-9], otherwise the greediness of the algorithm won't remove these): [^\s0-9]. This gives:
In [1]: re.sub(r"([^\s0-9])[0-9]+([^\s0-9])", r"\1\2", "11 a000b 11 11st x11 11")
Out[1]: '11 ab 11 11st x11 11'

What you should do is add parenthesis so as to define a group and specify that the digits need to be sourounded by strings.
re.sub(r"([^\s\d])\d+([^\s\d])", r'\1\2', newString)
This does match only digits which are between a character other than a space : [^\s] part.

Related

Replace one part of a pattern in a string/sentence?

There is a text blob for example
"Text blob1. Text blob2. Text blob3 45.6%. Text blob4."
I want to replace the dots i.e. "." with space " ". But at the same time, dots appearing between numbers should be retained. For example, the previous example should be converted to:
"Text blob1 Text blob2 Text blob3 45.6% Text blob4"
If I use:
p = re.compile('\.')
s = p.sub(' ', s)
It replaces all dots with space.
Any suggestions on what pattern or method works here?
Use
\.(?!(?<=\d\.)\d)
See proof. This expression will match any dot that has no digit after it that is preceded with a digit and a dot.
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of look-ahead
You might not need regex here. Replace a dot-space with a space.
s.replace('. ', ' ')
That isn't good enough if you have any periods followed by a newline or that terminate the string, but you still wouldn't need a regex:
s.replace('. ', ' ').replace('.\n', '\n').rstrip('.')
Suppose the string were
A.B.C blob3 45.6%. Text blob4.
Match all periods other than those both preceded and followed by a digit
If after replacements, the string
A B C blob3 45.6% Text blob4
were desired, one could use re.sub with the regular expression
r'(?<!\d)\.|\.(?!\d)'
to replace matches of periods with empty strings.
The regex reads, "match a period that is not preceded by a character other than a digit or is not followed by a character other than a digit".
Demo 1
The double-negative is employed to match a period at the beginning or end of the string. One could instead use the logical equivalent:
r'(?<=^|\D)\.|\.(?=\D|$)'
Match all periods except those both preceded and followed by a whitespace character
On the other hand, if, after substitutions, the string
A.B.C blob3 45.6% Text blob4
were desired one could use re.sub with the regular expression
r'(?<!\S)\.|\.(?!\S)'
to replace matches of periods with empty strings.
This regex reads, "match a period that is not preceded by a character other than a whitespace or is not followed by a character other than a whitespace".
Demo 2
One could instead use the logical equivalent:
r'(?<=^|\s)\.|\.(?=\s|$)'

RegEx for matching two digits and everything except new lines and dot

Using python v3, I'm trying to find a string only if it contains one to two digits (and not anymore than that in the same number) along with everything else following it. The match breaks on periods or new lines.
\d{1,2}[^.\n]+ is almost right except it returns numbers greater than two digits.
For example:
"5+years {} experience. stop.
10 asdasdas . 255
1abc1
5555afasfasf++++s()(jn."
Should return:
5+years {} experience
10 asdasdas
1abc1
Based upon your description and your sample data, you can use following regex to match the intended strings and discard others,
^\d[^\d.]*\d?[^\d.\n]*(?=\.|$)
Regex Explanation:
^ - Start of line
\d - Matches a digit
[^\d.]* - This matches any character other than digit or dot zero or more times. This basically allows optionally matching of non-digit non-dot characters.
\d? - As you want to allow one or two digits, this is the second digit which is optional hence \d followed by ?
[^\d.\n]* - This matches any character other than digit or dot or newline
(?=\.|$) - This positive look ahead ensures, the match either ends with a dot or end of line
Also, notice, multiline mode is enabled as ^ and $ need to match start of line and end of line.ad
Regex Demo 1
Code:
import re
s = '''5+years {} experience. stop.
10 asdasdas . 255
1abc1
5555afasfasf++++s()(2jn.'''
print(re.findall(r'(?m)^\d[^\d.]*\d?[^\d.\n]*(?=\.|$)', s))
Prints:
['5+years {} experience', '10 asdasdas ', '1abc1']
Also, if matching lines doesn't necessarily start with digits, you can use this regex to capture your intended string but here you need to get your string from group1 if you want captured string to start with number only, and if intended string doesn't necessarily have to start with digits, then you can capture whole match.
^[^\d\n]*(\d[^\d.]*\d?[^\d.\n]*)(?=\.|$)
Regex Explanation:
^ - Start of line
[^\d\n]* - Allows zero or more non-digit characters before first digit
( - Starts first grouping pattern to capture the string starting with first digit
\d - Matches a digit
[^\d.]* - This matches any character other than digit or dot zero or more times. This basically allows optionally matching of non-digit non-dot characters.
\d? - As you want to allow one or two digits, this is the second digit which is optional hence \d followed by ?
[^\d.\n]* - This matches any character other than digit or dot or newline
`) - End of first capturing pattern
(?=\.|$) - This positive look ahead ensures, the match either ends with a dot or end of line
Multiline mode is enabled which you can enable by placing (?m) before start of regex also called inline modifier or by passing third argument to re.search as re.MULTILINE
Regex Demo 2
Code:
import re
s = '''5+years {} experience. stop.
10 asdasdas . 255
1abc1
aaa1abc1
aa2aa1abc1
5555afasfasf++++s()(2jn.'''
print(re.findall(r'(?m)^[^\d\n]*(\d[^\d.]*\d?[^\d.\n]*)(?=\.|$)', s))
Prints:
['5+years {} experience', '10 asdasdas ', '1abc1', '1abc1']

strange output regular expression r'[-.\:alnum:](.*)'

I expect to fetch all alphanumeric characters after "-"
For an example:
>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)', str1)
[' mystr']
First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].
Second, I cannot understand why this can be fetched, if there is no "-":
>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)', str2)
['io']
First of all, Python re does not support POSIX character classes.
The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.
Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use
re.findall(r'-\s*([^\W_]+)', s)
See the regex demo
Details
- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.
Python demo:
print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio')) # => []
Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".
In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.
Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.
As #revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.
Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).
However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).
Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.
If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.
re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)
If you want to find each string of alphanumerics after a each '-' then you can do this.
re.findall('(?<=-)[a-zA-Z\d]+')

Regex: Separating All Caps from Numbers

I am using python regex to read documents.
I have the following line in many documents:
Dated: February 4, 2011 THE REAL COMPANY, INC
I can use python text search to easily find the lines that have "dated," but I want to pull THE REAL COMPANY, INC from the text without getting the "February 4, 2011" text.
I have tried the following:
[A-Z\s]{3,}.*INC
My understanding of this regex is it should get me all capital letters and spaces before LLP, but instead it pulls the full line.
This suggests to me I'm fundamentally missing something about how regex works with capital letters. Is there an easy and obvious explanation I'm missing?
what about using:
>>> import re
>>> txt
'Dated: February 4, 2011 THE REAL COMPANY, INC'
>>> re.findall('([A-Z][A-Z]+)', txt)
['THE', 'REAL', 'COMPANY', 'INC']
Another way around is as follows as suggested by #davedwards:
>>> re.findall('[A-Z\s]{3,}.*', txt)
[' THE REAL COMPANY, INC']
Explanation:
[A-Z\s]{3,}.*
Match a single character present in the list below [A-Z\s]{3,}
{3,} Quantifier — Matches between 3 and unlimited times, as many times as possible, giving back as needed (greedy)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
You could use
^Dated:.*?\s([A-Z ,]{3,})
And make use of the first capturing group, see a demo on regex101.com.
Your regex [A-Z\s]{3,}.*INC matches 3 or more times an uppercase character or a whitespace character followed by 0+ times any character and then INC which will match: THE REAL COMPANY, INC
What you could also do is match Dated: from the start of the string followed by a date like format and then capture what comes after in a group. Your value will be in the first capturing group:
^Dated:\s+\S+\s+\d{1,2},\s+\d{4}\s+(.*)$
Explanation
^Dated:\s+ Match dated: followed by 1+ times a whitespace character
\S+\s+ Match 1+ times not a whitespace character followed by 1+ times a whitespace character whic will match February in this case
\d{1,2}, Match 1-2 times a digit
\s+\d{4}\s+ match 1+ times a whitespace character, 4 digits, followed by 1+ times a whitespace character
(.*) Capture in a group 0+ times any character
$ Assert the end of the string
Regex demo

regex to extract first series of numbers in a string and all words after

Trying to write a regex that will do the following in python 2.7:
FOO 288-B BAR <MATCH: "288-B BAR">
BURT 69/ERNIE 96/KERMIT 287 <MATCH: "69">
53 ORANGE <MATCH: "53 ORANGE">
APPLE 457-W <MATCH: "457-W">
Except for "space" and '-' and '/' no other punctuation. I just want to match the first occurrence of any number and any letter/word following that is preceeded by a '-' or a "space".
I have tried:
([\d]+)(-?[\w+])
This misses the letters AFTER the space. Adding \s? doesn't go well for me.
(\d+(?:(?:\-\w+)|\w)?)(.*)
This picks up the letters but I can't seem to modify it to get rid of the stuff after the backslash.
(\d+(?:(?:\-\w+)|\w))[^\/]*(\/*.*)
I'm trying to use [] to tackle those backslashes. This was clearly unsuccessfull.
If I understand your requirements, you can use this, then retrieve the matches from Group 1:
(?im)^\D*(\d+(?:[- ][a-z ]*[a-z])?)
Here's a demo (please look at the capture groups in the bottom right pane).
To retrieve the matches:
for match in re.finditer(r"(?im)^\D*(\d+(?:[- ][a-z ]*[a-z])?)", subject):
yournumber = match.group(1)
How does it work?
The ^ in (?im) multi-line, case-insensitive mode anchors us at the beginning of the line.
The \D* skips any non-digits
The (\d+(?:[- ][a-z ]*[a-z])?) matches, and captures to Group 1, digits optionally followed by a dash or a space and more spaces and letters, ending with a letter.

Categories

Resources