Regular expression to match only two numbers - python

I have such string:
something: 20 kg/ something: 120 kg
I have this regex ("[0-9]{1,2} kg", string), but it returns 20kg both times. I need to return 20kg only in first case.

Try this:
(?<!\d)\d{1,2}\s+kg
The (?<!...) is a negative look behind. So it matches one or two digits not preceded by a digit. I also changed the literal space with one or more space chars.
Seeing you've asked Python questions, here's a demo in Python:
#!/usr/bin/env python
import re
string = 'something: 20 kg/ something: 120 kg'
print re.findall(r'(?<!\d)\d{1,2}\s+kg', string)
which will print ['20 kg']
edit
As #Tim mentioned, a word boundary \b is enough: r'\b\d{1,2}\s+kg'

Related

Parsing based on pattern not at the beginning

I want to extract the number before "2022" in a set of strings possibly. I current do
a= mystring.strip().split("2022")[0]
and, for instance, when mystring=' 1020220519AX', this gives a = '10'. However,
mystring.strip().split("2022")[0]
fails when mystring=' 20220220519AX' to return a='202'. Therefore, I want the code to split the string on "2022" that is not at the beginning non-whitespace characters in the string.
Can you please guide with this?
Use a regular expression rather than split().
import re
mystring = ' 20220220519AX'
match = re.search(r'^\s*(\d+?)2022', mystring)
if match:
print(match.group(1))
^\s* skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.
You can tell a regex engine that you want all the digits before 2022:
r'\d+(?=2022)'
Like .split(), a regex engine is 'greedy' by default - 'greedy' here means that as soon as it can take something that it is instructed to take, it will take that and it won't try another option, unless the rest of the expression cannot be made to work.
So, in your case, mystring.strip().split("2022") splits on the first 2020 it can find and since there's nothing stopping it, that is the result you have to work with.
Using regex, you can even tell it you're not interested in the 2022, but in the numbers before it: the \d+ will match as long a string of digits it can find (greedy), but the (?=2022) part says it must be followed by a literal 2022 to be a match (and that won't be part of the match, a 'positive lookahead').
Using something like:
import re
mystring = ' 20220220519AX'
print(re.findall(r'\d+(?=2022)', mystring))
Will show you all consecutive matches.
Note that for a string like ' 920220220519AX 12022', it will find ['9202', '1'] and only that - it won't find all possible combinations of matches. The first, greedy pass through the string that succeeds is the answer you get.
You could split() asserting not the start of the string to the left after using strip(), or you can get the first occurrence of 1 or more digits from the start of the string, in case there are more occurrences of 2022
import re
strings = [
' 1020220519AX',
' 20220220519AX'
]
for s in strings:
parts = re.split(r"(?<!^)2022", s.strip())
if parts:
print(parts[0])
for s in strings:
m = re.match(r"\s*(\d+?)2022", s)
if m:
print(m.group(1))
Both will output
10
202
Note that the split variant does not guarantee that the first part consists of digits, it is only splitted.
If the string consists of only word characters, splitting on \B2022 where \B means non a word boundary, will also prevent splitting at the start of the example string.

Regex thousand operator either , or

I'm using a python Regex and I'm receiving numbers with either a , as a thousand sign or a . If the , is the thousand sign then a . is the decimal sign and vice versa. The only positive thing there though is that there are always two decimal numbers.
I need to regex these number and I don't care about the decimal number so I would like to extract the following. Can someone smarter than me help? This is giving me a headache.
111.112.123,55 -> 111112123
123.44 -> 123
123,353,123.55 -> 123353123
21,23 -> 21
152.00 -> 152
You may use the following pattern:
[,.]\d+$|[.,]
[,.] Character set for either , or ..
\d+$ Digits at end of string.
| Alternation (OR).
[.,] Character set for either , or ..
Regex demo here.
Python demo:
import re
mynumbers=['111.112.123,55','123.44','123,353,123.55','21,23','152.00']
for number in mynumbers:
print(re.sub(r'[,.]\d+$|[.,]','',number))
Prints:
111112123
123
123353123
21
152
You may alternatively use a more restrictive pattern if you are working with text:
[.,]\d+$|(?<=\d{3})[.,]
Regex demo here.
Python:
mytext = '''
111.112.123,55
123.44
123,353,123.55
21,23
152.00
Text, and punctuation.
'''
for line in mytext.splitlines():
print(re.sub(r'[.,]\d+$|(?<=\d{3})[.,]','',line))
Prints:
111112123
123
123353123
21
152
Text, and punctuation.
Assuming you are dealing with strings that only contain one number, you can use this pattern:
re.sub(r'[.,](?:\d\d$)?', '', s)
(a , or a . eventually followed by 2 digits and the end of the string.)
You could capture one or more digits in a capturing group (\d+) followed by a character class [.,] that matches either a dot or a comma.
To match the digits at the end you could use an optional non capturing group (?:\d+$)? that would match one or more times a digit followed by asserting the end of the line. You might start the match with a word boundary to prevent it from being part of a longer match
In the replacement use the first capturing group \1
\b(\d+)[.,](?:\d+$)?
Regex demo
Python demo

Python regular expression. Find a sentence in a sentence

I'm trying to find an expression "K others" in a sentence "Chris and 34K others"
I tried with regular expression, but it doesn't work :(
import re
value = "Chris and 34K others"
m = re.search("(.K.others.)", value)
if m:
print "it is true"
else:
print "it is not"
Guessing that you're web-page scraping "you and 34k others liked this on Facebook", and you're wrapping "K others" in a capture group, I'll jump straight to how to get the number:
import re
value = "Chris and 34K others blah blah"
# regex describes
# a leading space, one or more characters (to catch punctuation)
# , and optional space, trailing 'K others' in any capitalisation
m = re.search("\s(\w+?)\s*K others", value, re.IGNORECASE)
if m:
captured_values = m.groups()
print "Number of others:", captured_values[0], "K"
else:
print "it is not"
Try this code on repl.it
This should also cover uppercase/lowercase K, numbers with commas (1,100K people), spaces between the number and the K, and work if there's text after 'others' or if there isn't.
You should use search rather than match unless you expect your regular expression to match at the beginning. The help string for re.match mentions that the pattern is applied at the start of the string.
If you want to match something within the string, use re.search. re.match starts at the beginning, Also, change your RegEx to: (K.others), the last . ruins the RegEx as there is nothing after, and the first . matches any character before. I removed those:
>>> bool(re.search("(K.others)", "Chris and 34K others"))
True
The RegEx (K.others) matches:
Chris and 34K others
^^^^^^^^
Opposed to (.K.others.), which matches nothing. You can use (.K.others) as well, which matches the character before:
Chris and 34K others
^^^^^^^^^
Also, you can use \s to escape space and match only whitespace characters: (K\sothers). This will literally match K, a whitespace character, and others.
Now, if you want to match all preceding and all following, try: (.+)?(K\sothers)(\s.+)?. Here's a link to repl.it. You can get the number with this.

using reg exp to check if test string is of a fixed format

I want to make sure using regex that a string is of the format- "999.999-A9-Won" and without any white spaces or tabs or newline characters.
There may be 2 or 3 numbers in the range 0 - 9.
Followed by a period '.'
Again followed by 2 or 3 numbers in the range 0 - 9
Followed by a hyphen, character 'A' and a number between 0 - 9 .
This can be followed by anything.
Example: 87.98-A8-abcdef
The code I have come up until now is:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9][0-9][.][0-9][0-9][-A][0-9][-]*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
This doesn't seem to work. I'm not sure what I'm missing and also the problem here is I'm not checking for white spaces, tabs and new line characters and also hard-coded the number for integers before and after decimal.
With {m,n} you can specify the number of times a pattern can repeat, and the \d character class matches all digits. The \S character class matches anything that is not whitespace. Using these your regular expression can be simplified to:
re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
Note also the \Z anchor, making the \S* expression match all the way to the end of the string. No whitespace (newlines, tabs, etc.) are allowed here. If you combine this with the .match() method you assure that all characters in your tested string conform to the pattern, nothing more, nothing less. See search() vs. match() for more information on .match().
A small demonstration:
>>> import re
>>> pattern = re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
>>> pattern.match('87.98-A1-help')
<_sre.SRE_Match object at 0x1026905e0>
>>> pattern.match('123.45-A6-no whitespace allowed')
>>> pattern.match('123.45-A6-everything_else_is_allowed')
<_sre.SRE_Match object at 0x1026905e0>
Let's look at your regular expression. If you want:
"2 or 3 numbers in the range 0 - 9"
then you can't start your regular expression with '^[0-9][0-9][.] because that will only match strings with exactly two integers at the beginning. A second issue with your regex is at the end: [0-9][-]* - if you wish to match anything at the end of the string then you need to finish your regular expression with .* instead. Edit: see Martijn Pieters's answer regarding the whitespace in the regular expressions.
Here is an updated regular expression:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9]{2,3}\.[0-9]{2,3}-A[0-9]-.*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
Not everything needs to be enclosed inside [ and ], in particular when you know the character(s) that you wish to match (such as the part -A). Furthermore:
the notation {m,n} means: match at least m times and at most n times, and
to explicitly match a dot, you need to escape it: that's why there is \. in the regular expression above.

Python Regular Expression Match All 5 Digit Numbers but None Larger

I'm attempting to string match 5-digit coupon codes spread throughout a HTML web page. For example, 53232, 21032, 40021 etc... I can handle the simpler case of any string of 5 digits with [0-9]{5}, though this also matches 6, 7, 8... n digit numbers. Can someone please suggest how I would modify this regular expression to match only 5 digit numbers?
>>> import re
>>> s="four digits 1234 five digits 56789 six digits 012345"
>>> re.findall(r"\D(\d{5})\D", s)
['56789']
if they can occur at the very beginning or the very end, it's easier to pad the string than mess with special cases
>>> re.findall(r"\D(\d{5})\D", " "+s+" ")
Without padding the string for special case start and end of string, as in John La Rooy answer one can use the negatives lookahead and lookbehind to handle both cases with a single regular expression
>>> import re
>>> s = "88888 999999 3333 aaa 12345 hfsjkq 98765"
>>> re.findall(r"(?<!\d)\d{5}(?!\d)", s)
['88888', '12345', '98765']
full string: ^[0-9]{5}$
within a string: [^0-9][0-9]{5}[^0-9]
Note: There is problem in using \D since \D matches any character that is not a digit , instead use \b.
\b is important here because it matches the word boundary but only at end or beginning of a word .
import re
input = "four digits 1234 five digits 56789 six digits 01234,56789,01234"
re.findall(r"\b\d{5}\b", input)
result : ['56789', '01234', '56789', '01234']
but if one uses
re.findall(r"\D(\d{5})\D", s)
output : ['56789', '01234']
\D is unable to handle comma or any continuously entered numerals.
\b is important part here it matches the empty string but only at end or beginning of a word .
More documentation: https://docs.python.org/2/library/re.html
More Clarification on usage of \D vs \b:
This example uses \D but it doesn't capture all the five digits number.
This example uses \b while capturing all five digits number.
Cheers
A very simple way would be to match all groups of digits, like with r'\d+', and then skip every match that isn't five characters long when you process the results.
You probably want to match a non-digit before and after your string of 5 digits, like [^0-9]([0-9]{5})[^0-9]. Then you can capture the inner group (the actual string you want).
You could try
\D\d{5}\D
or maybe
\b\d{5}\b
I'm not sure how python treats line-endings and whitespace there though.
I believe ^\d{5}$ would not work for you, as you likely want to get numbers that are somewhere within other text.
I use Regex with easier expression :
re.findall(r"\d{5}", mystring)
It will research 5 numerical digits. But you have to be sure not to have another 5 numerical digits in the string

Categories

Resources