Extracting numbers from a string under certain conditions

Extracting numbers from a string under certain conditions - python

I have some strings that are stored in a dataframe using pandas and I want to extract all numbers out of them if it exists. The conditions these numbers must meet are quite specific and I'm not really sure if I can use regex to solve my problem. The conditions are:
The number CANNOT be at the start of the string
It CANNOT appear after the word "No. " or after the word "Question "
Also if possible, if the number has an e right after it I would want to keep that as well. However this is less important.
This is what I have so far to find all the numbers, but I do not know how to code the conditions I mentioned above.
testNumbers = re.findall(r'\d+', row['Name'])
For a given string: " Test T860 Article No. 9712250 787"
I would want the regex expression to return
[860, 787]

You may use
(?!^)(?<!\d)(?<!\bNo\.\s)(?<!\bQuestion\s)(\d+)(?!\d)
In Python, declare as a raw string literal:
pattern = r'(?!^)(?<!\d)(?<!\bNo\.\s)(?<!\bQuestion\s)(\d+)(?!\d)'
See the regex demo
Details
(?!^) - not at the start of the string
(?<!\d) - no digit immediately before the current location is allowed
(?<!\bNo\.\s) - no No. and a whitespace immediately before is allowed
(?<!\bQuestion\s) - no Question and a whitespace immediately before is allowed
(\d+) - Group 1: one or more digits
(?!\d) - no digit immediately after the current location is allowed.
In Pandas, you may use it like
df = pd.DataFrame({'text':[" Test T860 Article No. 9712250 787"," Test F199 Article Question 9712250787"]})
df['numbers'] = df['text'].str.findall(r'(?!^)(?<!\d)(?<!\bNo\.\s)(?<!\bQuestion\s)(\d+)(?!\d)').apply(','.join)
Output:
>>> df
text numbers
0 Test T860 Article No. 9712250 787 860,787
1 Test F199 Article Question 9712250787 199

Here we can use an expression with word boundaries and quantifier:
\b[A-Z]+(\d+)\b|\b([0-9]{1,3})\b
Demo
RegEx
If this expression wasn't desired or you wish to modify it, please visit regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:

Related

regex: how to get repeating blocks as groups()? [duplicate]

I need to capture multiple groups of the same pattern. Suppose, I have the following string:
HELLO,THERE,WORLD
And I've written the following pattern
^(?:([A-Z]+),?)+$
What I want it to do is to capture every single word, so that Group 1 is : "HELLO", Group 2 is "THERE" and Group 3 is "WORLD". What my regex is actually capturing is only the last one, which is "WORLD".
I'm testing my regular expression here and I want to use it with Swift (maybe there's a way in Swift to get intermediate results somehow, so that I can use them?)
UPDATE: I don't want to use split. I just need to now how to capture all the groups that match the pattern, not only the last one.

With one group in the pattern, you can only get one exact result in that group. If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
You have to use your language's regex implementation functions to find all matches of a pattern, then you would have to remove the anchors and the quantifier of the non-capturing group (and you could omit the non-capturing group itself as well).
Alternatively, expand your regex and let the pattern contain one capturing group per group you want to get in the result:
^([A-Z]+),([A-Z]+),([A-Z]+)$

The key distinction is repeating a captured group instead of capturing a repeated group.
As you have already found out, the difference is that repeating a captured group captures only the last iteration. Capturing a repeated group captures all iterations.
In PCRE (PHP):
((?:\w+)+),?
Match 1, Group 1. 0-5 HELLO
Match 2, Group 1. 6-11 THERE
Match 3, Group 1. 12-20 BRUTALLY
Match 4, Group 1. 21-26 CRUEL
Match 5, Group 1. 27-32 WORLD
Since all captures are in Group 1, you only need $1 for substitution.
I used the following general form of this regular expression:
((?:{{RE}})+)
Example at regex101

I think you need something like this....
b="HELLO,THERE,WORLD"
re.findall('[\w]+',b)
Which in Python3 will return
['HELLO', 'THERE', 'WORLD']

After reading Byte Commander's answer, I want to introduce a tiny possible improvement:
You can generate a regexp that will match either n words, as long as your n is predetermined. For instance, if I want to match between 1 and 3 words, the regexp:
^([A-Z]+)(?:,([A-Z]+))?(?:,([A-Z]+))?$
will match the next sentences, with one, two or three capturing groups.
HELLO,LITTLE,WORLD
HELLO,WORLD
HELLO
You can see a fully detailed explanation about this regular expression on Regex101.
As I said, it is pretty easy to generate this regexp for any groups you want using your favorite language. Since I'm not much of a swift guy, here's a ruby example:
def make_regexp(group_regexp, count: 3, delimiter: ",")
regexp_str = "^(#{group_regexp})"
(count - 1).times.each do
regexp_str += "(?:#{delimiter}(#{group_regexp}))?"
end
regexp_str += "$"
return regexp_str
end
puts make_regexp("[A-Z]+")
That being said, I'd suggest not using regular expression in that case, there are many other great tools from a simple split to some tokenization patterns depending on your needs. IMHO, a regular expression is not one of them. For instance in ruby I'd use something like str.split(",") or str.scan(/[A-Z]+/)

Just to provide additional example of paragraph 2 in the answer. I'm not sure how critical it is for you to get three groups in one match rather than three matches using one group. E.g., in groovy:
def subject = "HELLO,THERE,WORLD"
def pat = "([A-Z]+)"
def m = (subject =~ pat)
m.eachWithIndex{ g,i ->
println "Match #$i: ${g[1]}"
}
Match #0: HELLO
Match #1: THERE
Match #2: WORLD

The problem with the attempted code, as discussed, is that there is one capture group matching repeatedly so in the end only the last match can be kept.
Instead, instruct the regex to match (and capture) all pattern instances in the string, what can be done in any regex implementation (language). So come up with the regex pattern for this.
The defining property of the shown sample data is that the patterns of interest are separated by commas so we can match anything-but-a-comma, using a negated character class
[^,]+
and match (capture) globally, to get all matches in the string.
If your pattern need be more restrictive then adjust the exclusion list. For example, to capture words separated by any of the listed punctuation
[^,.!-]+
This extracts all words from hi,there-again!, without the punctuation. (The - itself should be given first or last in a character class, unless it's used in a range like a-z or 0-9.)
In Python
import re
string = "HELLO,THERE,WORLD"
pattern = r"([^,]+)"
matches = re.findall(pattern,string)
print(matches)
In Perl (and many other compatible systems)
use warnings;
use strict;
use feature 'say';
my $string = 'HELLO,THERE,WORLD';
my #matches = $string =~ /([^,]+)/g;
say "#matches";
(In this specific example the capturing () in fact aren't needed since we collect everything that is matched. But they don't hurt and in general they are needed.)
The approach above works as it stands for other patterns as well, including the one attempted in the question (as long as you remove the anchors which make it too specific). The most common one is to capture all words (usually meaning [a-zA-Z0-9_]), with the pattern \w+. Or, as in the question, get only the substrings of upper-case ascii letters[A-Z]+.

I know that my answer came late but it happens to me today and I solved it with the following approach:
^(([A-Z]+),)+([A-Z]+)$
So the first group (([A-Z]+),)+ will match all the repeated patterns except the final one ([A-Z]+) that will match the final one. and this will be dynamic no matter how many repeated groups in the string.

You actually have one capture group that will match multiple times. Not multiple capture groups.
javascript (js) solution:
let string = "HI,THERE,TOM";
let myRegexp = /([A-Z]+),?/g; // modify as you like
let match = myRegexp.exec(string); // js function, output described below
while (match != null) { // loops through matches
console.log(match[1]); // do whatever you want with each match
match = myRegexp.exec(string); // find next match
}
Syntax:
// matched text: match[0]
// match start: match.index
// capturing group n: match[n]
As you can see, this will work for any number of matches.

Sorry, not Swift, just a proof of concept in the closest language at hand.
// JavaScript POC. Output:
// Matches: ["GOODBYE","CRUEL","WORLD","IM","LEAVING","U","TODAY"]
let str = `GOODBYE,CRUEL,WORLD,IM,LEAVING,U,TODAY`
let matches = [];
function recurse(str, matches) {
let regex = /^((,?([A-Z]+))+)$/gm
let m
while ((m = regex.exec(str)) !== null) {
matches.unshift(m[3])
return str.replace(m[2], '')
}
return "bzzt!"
}
while ((str = recurse(str, matches)) != "bzzt!") ;
console.log("Matches: ", JSON.stringify(matches))
Note: If you were really going to use this, you would use the position of the match as given by the regex match function, not a string replace.

Design a regex that matches each particular element of the list rather then a list as a whole. Apply it with /g
Iterate throught the matches, cleaning them from any garbage such as list separators that got mixed in. You may require another regex, or you can get by with simple replace substring method.
The sample code is in JS, sorry :) The idea must be clear enough.
const string = 'HELLO,THERE,WORLD';
// First use following regex matches each of the list items separately:
const captureListElement = /^[^,]+|,\w+/g;
const matches = string.match(captureListElement);
// Some of the matches may include the separator, so we have to clean them:
const cleanMatches = matches.map(match => match.replace(',',''));
console.log(cleanMatches);

repeat the A-Z pattern in the group for the regular expression.
data="HELLO,THERE,WORLD"
pattern=r"([a-zA-Z]+)"
matches=re.findall(pattern,data)
print(matches)
output
['HELLO', 'THERE', 'WORLD']

only keep digits after ":" in regular expression in python [duplicate]

This question already has answers here:
How to grab number after word in python
(4 answers)
Closed 2 years ago.
I want to extract the numbers for each parameter below:
import re
parameters = '''
NO2: 42602
SO2: 42401
CO: 42101
'''
The desired output should be:['42602','42401','42101']
I first tried re.findall(r'\d+',parameters), but it also returns the "2" from "NO2" and "SO2".
Then I tried re.findall(':.*',parameters), but it returns [': 42602', ': 42401', ': 42101']
If I can not rename the "NO2" to "Nitrogen dioxide", is there a way just to collect numbers on the right (after ":")?
Many thanks.

If you do not want to use capturing groups, you could use look behind.
(?<=:\s)\d+
Details:
(?<=:\s): gets string after :\s
\d+: gets digits
I also tried result on python.
import re
parameters = '''
NO2: 42602
SO2: 42401
CO: 42101
'''
result = re.findall(r'(?<=:\s)\d+',parameters)
print (result)
Result
['42602', '42401', '42101']

You can use the following regex to capture the numbers
^\s*\w+:\s(\d+)$
Hereby, ^ in the beginning asserts the position at the start of the line. \s* means that there may be 0 or more whitespaces before the content. \w+:\s matches a word character followed by ":" and space, that is "NO2: ".
Finally, (\d+) matches the following digits you want as a group. $ matches the end of the line.
To get all the matches as a list you can use
matches = re.findall(r'^\s*\w+:\s(\d+)$', parameters, re.MULTILINE)
As re.MULTILINE is specified,
the pattern character '^' matches at the beginning of the string and
at the beginning of each line.
as stated in the docs.
The result is as follows
>> print(matches)
['42602', '42401', '42101']

To put my two cents in, you could simpley use
re.findall(r'(\b\d+\b)', parameters)
See a demo on regex101.com.
If you happen to have other digits floating around somewhere in your string, be more precise with
\w+:\s*(\d+)
See another demo on regex101.com.

re.findall(r'(?<=:\s)\d+', parameters)
Should work. You can learn more about look-behind from here.

You just need to specify where in your string do you want to search for digits, you can use:
re.findall(r': (\d+)', parameters)
This tells Python to look for digits in the part of the string after ":" and the "space".

Find first pattern and retrieve data up to next pattern find

I have a string as follows:
my_str = "808c000003a185c50cd9b00285e78220500ac56a1c5ca5a1004b2404aa412f058c0a1ba85820cc8208080813c7040a228e0133ca5aca03a2829012533208704411004010808c001003a1c5c50cd9b00285e7822"
I want to group together the strings if they meet a condition
Whenever there is a sequence of '0808', I want this and the following text UP TO the point of the next 0808 pattern etc..
result = re.findall(r'(0808)', my_str)
This just gives me a list of the pattern itself.
I want it to include the pattern and the following text. Something fundamentally wrong with the regex i'm inputting.
Help appreciated.

0808.*?(?=0808)
This looks for the 0808 string then lazily matches 0 or more chars afterwards up to the point of the lookahead to 0808
This will return a match for each selection of 0808foobar matches

Regex sub phone number format multiple times on same string

I'm trying to use reg expressions to modify the format of phone numbers in a list.
Here is a sample list:
["(123)456-7890 (321)-654-0987",
"(111) 111-1111",
"222-222-2222",
"(333)333.3333",
"(444).444.4444",
"blah blah blah (555) 555.5555",
"666.666.6666 random text"]
Every valid number has either a space OR start of string character leading, AND either a space OR end of string character trailing. This means that there can be random text in the strings, or multiple numbers on one line. My question is: How can I modify the format of ALL the phone numbers with my match pattern below?
I've written the following pattern to match all valid formats:
p = re.compile(r"""
(((?<=\ )|(?<=^)) #space or start of string
((\([0-9]{3}\))|[0-9]{3}) #Area code
(((-|\ )?[0-9]{3}-[0-9]{4}) #based on '-'
| #or
((\.|\ )?[0-9]{3}\.[0-9]{4})) #based on '.'
(?=(\ |$))) #space or end of string
""", re.X)
I want to modify the numbers so they adhere to the format:
\(\d{3}\)d{3}-\d{4} #Ex: (123)456-7890
I tried using re.findall, and re.sub but had no luck. I'm confused on how to deal with the circumstance of there being multiple matches on a line.
EDIT: Desired output:
["(123)456-7890 (321)654-0987",
"(111)111-1111",
"(222)222-2222",
"(333)333-3333",
"(444)444-4444",
"blah blah blah (555)555-5555",
"(666)666-6666 random text"]

Here's a more simple solution that works for all of those cases, though is a little naïve (and doesn't care about matching brackets).
\(?(\d{3})\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\1)\2-\3
Try it online
Explanation:
Works by first checking for 3 digits, and optionally surrounding brackets on either side, with \(?(\d{3})\)?. Notice that the 3 digits are in a capturing group.
Next, it checks for an optional separator character, and then another 3 digits, also stored in a capturing group: [ -.]?(\d{3}).
And lastly, it does the previous step again - but with 4 digits instead of 3: [ -.]?(\d{4})
Python:
To use it in Python, you should just be able to iterate over each element in the list and do:
p.sub('(\\1)\\2-\\3', myString) # Note the double backslashes, or...
p.sub(r'(\1)\2-\3', myString) # Raw strings work too
Example Python code
EDIT
This solution is a bit more complex, and ensures that if there is a close bracket, there must be a start bracket.
(\()?((?(1)\d{3}(?=\))|\d{3}(?!\))))\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\2)\3-\4
Try it online

How to specify the regex string in python

I have the following 2 strings of train station IDs (showing the direction of travel) separated by "-".
String A (strA):
NS1-NS2-NS3-NS4-NS5-NS7-NS8-NS9-NS10-NS11-NS13-NS14-NS15-NS16-NS17-NS18-NS19-NS20-NS21-NS22-NS23-NS24-NS25-NS26-NS27
String B (strB):
NS27-NS26-NS25-NS24-NS23-NS22-NS21-NS20-NS19-NS18-NS17-NS16-NS15-NS14-NS13-NS11-NS10-NS9-NS8-NS7-NS5-NS4-NS3-NS2-NS1
I want to find out which of String A or B contains stations "NS4" followed by "NS1" (answer should be String B).
My current code as follows:
searchStr = ".*NS4-.*NS1(-.*|)"
re.search(searchStr, strA)
re.search(searchStr, strB)
But the result keep returning a match in String A.
May I know how to specify 'searchStr' in order to match only String B?

Two ways to do it: tokenizing and improving the regex.
Tokenizing
tokA = strA.split('-')
tokB = strB.split('-')
print('NS4' in tokA and tokA.index('NS1') > tokA.index('NS4'))
print('NS4' in tokB and tokB.index('NS1') > tokB.index('NS4'))
# False
# True
Regex
import re
pattern = '(^|-)NS4.+NS1(-|$)'
print(re.search(pattern, strA) is not None)
print(re.search(pattern, strB) is not None)
# False
# True
Performance
Tokenization: 2.3072939129997394
Regex: 11.138173280000046
But if you really need performance, I'm sure there are faster ways. Even the tokenization method does multiple passes.

As an alternative to tokenizing, you could use the following expression.
NS4(?=.*?NS1(?!\d))
It literally means:
The characters "NS4" literally.
Followed by any characters, until it finds NS1.
NS1 cannot be followed by a digit.
To educate readers as to what I've used:
(?=) is a Positive Lookahead.
Whatever you place inside this token must be found for the match to be True.
I placed .*? to match anything, as few times as possible using the ? quantifier, followed by NS1 since that is what we want to find.
(?!) is a Negative Lookahead
Whatever you place inside this token, as you might guess, must NOT be found for the match to be True.
I placed a digit in here, so that things like NS10 or NS11 or NS19 are never matched.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting numbers from a string under certain conditions - python

Here we can use an expression with word boundaries and quantifier: \b[A-Z]+(\d+)\b|\b([0-9]{1,3})\b Demo RegEx If this expression wasn't desired or you wish to modify it, please visit regex101.com. RegEx Circuit jex.im visualizes regular expressions:

Related

regex: how to get repeating blocks as groups()? [duplicate]

only keep digits after ":" in regular expression in python [duplicate]

Find first pattern and retrieve data up to next pattern find

Regex sub phone number format multiple times on same string

How to specify the regex string in python

Categories

Resources