regex issue with numbers (python)

regex issue with numbers (python) - python

I have a string thath goes like that:
<123, 321>
the range of the numbers can be between 0 to 999.
I need to insert those coordinates as fast as possible into two variables, so i thought about regex. I've already splitted the string to two parts and now i need to isolate the integer from all the other characters.
I've tried this pattern:
^-?[0-9]+$
but the output is:
[]
any help? :)

If your strings follow the same format <123, 321> then this should be a little bit faster than regex approach
def str_translate(s):
return s.translate(None, " <>").split(',')
In [52]: str_translate("<123, 321>")
Out[52]: ['123', '321']

All you need to do is to get rid of the anchors( ^ and $)
>>> import re
>>> string = "<123, 321>"
>>> re.findall(r"-?[0-9]+", string)
['123', '321']
>>>
Note ^ $ Anchors at the start and end of patterns -?[0-9]+ ensures that the string consists only of digits.
That is the regex engine attempts to match the pattern from the start of the string, ^ using -?[0-9]+ till the end of the string $. But it fails because < cannot be matched by -?[0-9]+
Where as the re.findall will find all substrings that match the pattern -?[0-9]+, that is the digits.

"^-?[0-9]+$" will only match a string that contains a number and nothing else.
You want to match a group and the extract that group:
>>> pattern = re.compile("(-?[0-9]+)")
>>> pattern.findall("<123, 321>")
['123', '321']

Related

Regular expression matching all but a string

I need to find all the strings matching a pattern with the exception of two given strings.
For example, find all groups of letters with the exception of aa and bb. Starting from this string:
-a-bc-aa-def-bb-ghij-
Should return:
('a', 'bc', 'def', 'ghij')
I tried with this regular expression that captures 4 strings. I thought I was getting close, but (1) it doesn't work in Python and (2) I can't figure out how to exclude a few strings from the search. (Yes, I could remove them later, but my real regular expression does everything in one shot and I would like to include this last step in it.)
I said it doesn't work in Python because I tried this, expecting the exact same result, but instead I get only the first group:
>>> import re
>>> re.search('-(\w.*?)(?=-)', '-a-bc-def-ghij-').groups()
('a',)
I tried with negative look ahead, but I couldn't find a working solution for this case.

You can make use of negative look aheads.
For example,
>>> re.findall(r'-(?!aa|bb)([^-]+)', string)
['a', 'bc', 'def', 'ghij']
- Matches -
(?!aa|bb) Negative lookahead, checks if - is not followed by aa or bb
([^-]+) Matches ony or more character other than -
Edit
The above regex will not match those which start with aa or bb, for example like -aabc-. To take care of that we can add - to the lookaheads like,
>>> re.findall(r'-(?!aa-|bb-)([^-]+)', string)

You need to use a negative lookahead to restrict a more generic pattern, and a re.findall to find all matches.
Use
res = re.findall(r'-(?!(?:aa|bb)-)(\w+)(?=-)', s)
or - if your values in between hyphens can be any but a hyphen, use a negated character class [^-]:
res = re.findall(r'-(?!(?:aa|bb)-)([^-]+)(?=-)', s)
Here is the regex demo.
Details:
- - a hyphen
(?!(?:aa|bb)-) - if there is aaa- or bb- after the first hyphen, no match should be returned
(\w+) - Group 1 (this value will be returned by the re.findall call) capturing 1 or more word chars OR [^-]+ - 1 or more characters other than -
(?=-) - there must be a - after the word chars. The lookahead is required here to ensure overlapping matches (as this hyphen will be a starting point for the next match).
Python demo:
import re
p = re.compile(r'-(?!(?:aa|bb)-)([^-]+)(?=-)')
s = "-a-bc-aa-def-bb-ghij-"
print(p.findall(s)) # => ['a', 'bc', 'def', 'ghij']

Although a regex solution was asked for, I would argue that this problem can be solved easier with simpler python functions, namely string splitting and filtering:
input_list = "-a-bc-aa-def-bb-ghij-"
exclude = set(["aa", "bb"])
result = [s for s in input_list.split('-')[1:-1] if s not in exclude]
This solution has the additional advantage that result could also be turned into a generator and the result list does not need to be constructed explicitly.

Add [] around numbers in strings

I like to add [] around any sequence of numbers in a string e.g
"pixel1blue pin10off output2high foo9182bar"
should convert to
"pixel[1]blue pin[10]off output[2]high foo[9182]bar"
I feel there must be a simple way but its eluding me :(

Yes, there is a simple way, using re.sub():
result = re.sub(r'(\d+)', r'[\1]', inputstring)
Here \d matches a digit, \d+ matches 1 or more digits. The (...) around that pattern groups the match so we can refer to it in the second argument, the replacement pattern. That pattern simply replaces the matched digits with [...] around the group.
Note that I used r'..' raw string literals; if you don't you'd have to double all the \ backslashes; see the Backslash Plague section of the Python Regex HOWTO.
Demo:
>>> import re
>>> inputstring = "pixel1blue pin10off output2high foo9182bar"
>>> re.sub(r'(\d+)', r'[\1]', inputstring)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar'

You can use re.sub :
>>> s="pixel1blue pin10off output2high foo9182bar"
>>> import re
>>> re.sub(r'(\d+)',r'[\1]',s)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar
Here the (\d+) will match any combinations of digits and re.sub function will replace it with the first group match within brackets r'[\1]'.
You can start here to learn regular expression http://www.regular-expressions.info/

Python - why doesn't this simple regex work?

This code below should be self explanatory. The regular expression is simple. Why doesn't it match?
>>> import re
>>> digit_regex = re.compile('\d')
>>> string = 'this is a string with a 4 digit in it'
>>> result = digit_regex.match(string)
>>> print result
None
Alternatively, this works:
>>> char_regex = re.compile('\w')
>>> result = char_regex.match(string)
>>> print result
<_sre.SRE_Match object at 0x10044e780>
Why does the second regex work, but not the first?

Here is what re.match() says If zero or more characters at the beginning of string match the regular expression pattern ...
In your case the string doesn't have any digit \d at the beginning. But for the \w it has t at the beginning at your string.
If you want to check for digit in your string using same mechanism, then add .* with your regex:
digit_regex = re.compile('.*\d')

The second finds a match because string starts with a word character. If you want to find matches within the string, use the search or findall methods (I see this was suggested in a comment too). Or change your regex (e.g. .*(\d).*) and use the .groups() method on the result.

Split a string using an integer as a delimeter

I have a rather long txt file filled with strings of the format {letter}{number}{letter}. For instance, the first few lines of my file are:
A123E
G234W
R3L
H4562T
I am having difficulty finding the correct regex pattern to separate each line by alpha and numeric.
For instance, in the first line, I would like an array with the results:
print first_line[0] // A
print first_line[1] // 123
ptin first_line[2] // E
It seems like regex would be the way to go, but I'm still a regex novice. Could someone help point me in the correct direction on how to do this?
Then I plan to iterate over each of the lines and use the info as necessary.

Split on \d+:
import re
re.split(r'(\d+)', line)
\d is the character class matching the digits 0 through to 9, and we want to match at least 1 of them. By putting a capturing group around the \d+, re.split() will include the match in the output:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
Demo:
>>> import re
>>> re.split(r'(\d+)', 'A123E')
['A', '123', 'E']

python regular expression numbers in a row

I'm trying to check a string for a maximum of 3 numbers in a row for which I used:
regex = re.compile("\d{0,3}")
but this does not work for instance the string 1234 would be accepted by this regex even though the digit string if over length 3.

If you want to check a string for a maximum of 3 digits in string you need to use '\d{4,}' as you are only interest in the digits string over a length of 3.
import re
str='123abc1234def12'
print re.findall('\d{4,}',str)
>>> '[1234]'
If you use {0,3}:
str='123456'
print re.findall('\d{0,3}',str)
>>> ['123', '456', '']
The regex matches digit strings of maximum length 3 and empty strings but this cannot be used to test correctness. Here you can't check whether all digit strings are in length but you can easily check for digits string over the length.
So to test do something like this:
str='1234'
if re.match('\d{4,}',str):
print 'Max digit string too long!'
>>> Max digit string too long!

\d{0} matches every possible string. It's not clear what you mean by "doesn't work", but if you expect to match a string with digits, increase the repetition operator to {1,3}.
If you wish to exclude runs of 4 or more, try something like (?:^|\D)\d{1,3}(?:\D|$) and of course, if you want to capture the match, use capturing parentheses around \d{1,3}.

The method you have used is to find substrings with 0-3 numbers, it couldn't reach your expactation.
My solve:
>>> import re
>>> re.findall('\d','ds1hg2jh4jh5')
['1', '2', '4', '5']
>>> res = re.findall('\d','ds1hg2jh4jh5')
>>> len(res)
4
>>> res = re.findall('\d','23425')
>>> len(res)
5
so,next you just need use ‘if’ to judge the numbers of digits.

There could be a couple reasons:
Since you want \d to search for digits or numbers, you should probably spell that as "\\d" or r"\d". "\d" might happen to work, but only because d isn't special (yet) in a string. "\n" or "\f" or "\r" will do something totally different. Check out the re module documentation and search for "raw strings".
"\\d{0,3}" will match just about anything, because {0,3} means "zero or up to three". So, it will match the start of any string, since any string starts with the empty string.
or, perhaps you want to be searching for strings that are only zero to three numbers, and nothing else. In this case, you want to use something like r"^\d{0,3}$". The reason is that regular expressions match anywhere in a string (or only at the beginning if you are using re.match and not re.search). ^ matches the start of the string, and $ matches the end, so by putting those at each end you are not matching anything that has anything before or after \d{0,3}.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex issue with numbers (python) - python

If your strings follow the same format <123, 321> then this should be a little bit faster than regex approach def str_translate(s): return s.translate(None, " <>").split(',') In [52]: str_translate("<123, 321>") Out[52]: ['123', '321']

"^-?[0-9]+$" will only match a string that contains a number and nothing else. You want to match a group and the extract that group: >>> pattern = re.compile("(-?[0-9]+)") >>> pattern.findall("<123, 321>") ['123', '321']

Related

Regular expression matching all but a string

Add [] around numbers in strings

Python - why doesn't this simple regex work?

Split a string using an integer as a delimeter

python regular expression numbers in a row

Categories

Resources