Removing leading digits from string using Python? - python

I see many questions asking how to remove leading zeroes from a string, but I have not seen any that ask how to remove any and all leading digits from a string.
I have been experimenting with combinations of lstrip, type function, isdigit, slice notation and regex without yet finding a method.
Is there a simple way to do this?
For example:
"123ABCDEF" should become "ABCDEF"
"ABCDEF" should stay as "ABCDEF"
"1A1" should become "A1"
"12AB3456CD" should become "AB3456CD"

A simple way could be to denote all digits with string.digits, which quite simply provides you with a string containing all digit characters '0123456789' to remove with string.lstrip.
>>> from string import digits
>>> s = '123dog12'
>>> s.lstrip(digits)
'dog12'

I want to point out that though both Mitch and RebelWithoutAPulse's answer are correct, they do not do the same thing.
Mitch's answer left-strips any characters in the set '1', '2', '3', '4', '5', '6', '7', '8', '9', '0'.
>>> from string import digits
>>> digits
'0123456789'
>>> '123dog12'.lstrip(digits)
'dog12'
RevelWithoutAPulse's answern on the other hand, left-strips any character known to be a digit.
>>> import re
>>> re.sub('^\d+', '', '123dog12')
'dog12'
So what's the difference? Well, there are two differences:
There are many other digit characters than the indo-arabic numerals.
lstrip is ambiguous on RTL languages. Actually, it removes leading matching characters, which may be on the right side. Regexp's ^ operator is more straightforward about it.
Here are a few examples:
>>> '١٩٨٤فوبار٤٢'.lstrip(digits)
'١٩٨٤فوبار٤٢'
>>> re.sub('^\d+', '', '١٩٨٤فوبار٤٢')
'فوبار٤٢'
>>> '𝟏𝟗𝟖𝟒foobar𝟒𝟐'.lstrip(digits)
'𝟏𝟗𝟖𝟒foobar𝟒𝟐'
>>> re.sub('^\d+', '', '𝟏𝟗𝟖𝟒foobar𝟒𝟐')
'foobar𝟒𝟐'
(note for the Arabic example, Arabic being read from right to left, it is correct for the number on the right to be removed)
So… I guess the conclusion is be sure to pick the right solution depending on what you're trying to do.

Using regexes from re:
import re
re.sub('^\d+', '', '1234AB456')
Becomes:
'AB456'
replaces any positive amount of digits at the beginning of the string with empty string.

Related

How to split a string in python by certain characters?

I am trying to solve a problem with prefix notation, but I am stuck on the part, where I want to split my string into an array:
If I have the input +22 2 I want to get the array to look like this:['+', '22', '2']
I tried using the
import re
function, but I am not sure how it works.
I tried the
word.split(' ')
method, but it only helps with the spaces.. any ideas?
P.S:
In the prefix notation I will also have + - and *.
So I need to split the string so the space is not in the array, and +, -, * is in the array
I am thinking of
word = input()
array = word.split(' ')
Then after that I am thinking of splitting a string by these 3 characters.
Sample input:
'+-12 23*67 1'
Output:
['+', '-', '12', '23', '*', '67', '1']
You can use re to find patterns in text, it seems you are looking for either one of these: +, - and * or a digit group. So compile a pattern that looks for that and find all that match this pattern and you will get a list:
import re
pattern = re.compile(r'([-+*]|\d+)')
string = '+-12 23*67 1'
array = pattern.findall(string)
print(array)
# Output:
# ['+', '-', '12', '23', '*', '67', '1']
Also a bit of testing (comparing your sample strings with the expected output):
test_cases = {
'+-12 23*67 1': ['+', '-', '12', '23', '*', '67', '1'],
'+22 2': ['+', '22', '2']
}
for string, correct in test_cases.items():
assert pattern.findall(string) == correct
print('Tests completed successfully!')
Pattern explanation (you can read about this in the docs linked below):
r'([-+*]|\d+)'
r in front to make it a raw string so that Python interprets all the characters literally, this helps with escape sequences in the regex pattern because you can escape them with one backslash
(...) parentheses around (they are not necessary in this case) indicate a group which can later be retrieved if needed (but in this case they don't matter much)
[...] indicates that any single character from this group can be matched so it will match if any of -, + and * will be present
| logical or, meaning that can match either side (to differentiate between numbers and special characters in this case)
\d special escape sequence for digits, meaning to match any digit, the + there indicates matching any one or more digits
Useful:
re module, the docs there explain what each character in the pattern does

Regex matching for value between two patterns doesn't return all instances when the values are consecutive

I am trying to find all instance of a number within an equation. And for that, I wrote this python script:
re.findall(fr"([\-\+\*\/\(]|^)({val})([\-\+\*\/\)]|$)", equation)
Now, when I give it this: 20+5-20, and search for 20, the output is as expected: [('', '20', '+'), ('-', '20', '')]
But, when I simply do 20+20-5, it doesn't work anymore and I only get the first instance: [('', '20', '+')]
I don't understand why, it's not even a problem of 20 being at start and end, for example, this 5-20*4-20/3 will still match 20 very well. It just doesn't work when the value is repeated consecutively
how do I fix this?
Thank you
The reason your pattern initially does not work for 20+20-5 is that the character class after matching the first occurrence of 20 actually consumes the +
After consuming it, for the second occurrence of 20 right after it, this part of the pattern [\-\+\*\/\(]|^) can not match as there is no character to match with the character class, and it is not at the start of the string.
Using 20 for example at the place of {val} you can use lookarounds, which do not consume the value but only assert that it is present.
Note that you don't have to escape the values in the character class, and for the last assertion you don't have to add another non capture group.
(?:(?<=[-+*/(])|^)20(?=[-+*/)]|$)
Regex demo
import re
strings = [
"20+5-20",
"20+20-5"
]
val = 20
pattern = fr"(?:(?<=[-+*/(])|^){val}(?=[-+*/)]|$)"
for equation in strings:
print(re.findall(pattern, equation))
Output
['20', '20']
['20', '20']
I suggest just searching for all numbers (integer + decimal) in your expression, and then filtering for certain values:
inp = "20+5-20*3.20"
matches = re.findall(r'\d+(?:\.\d+)?', inp)
matches = [x for x in matches if x == '20']
print(matches) # ['20', '20']
Every number in your formula should only be surrounded by either arithmetic symbols, parentheses, or whitespace, all of which are non word characters.
I think I found an answer, still not sure how correct it is or why it's working and mine doesn't :/
re.findall(fr"(?:(?<=[\=\-\+\*\/\(])|^)({val})(?:(?=[\=\-\+\*\/\)])|$)", equation
basically, performing backward lookup and forward lookup to see if the value is between operations

Regex expression for a given string

I have a small issue i am running into. I need a regular expression that would split a passed string with numbers separately and anything chunk of characters within square brackets separately and regular set of string separately.
for example if I have a strings that resembles
s = 2[abc]3[cd]ef
i need a list with lst = ['2','abc','3','cd','ef']
I have a code so far that has this..
import re
s = "2[abc]3[cd]ef"
s_final = ""
res = re.findall("(\d+)\[([^[\]]*)\]", s)
print(res)
This is outputting a list of tuples that looks like this.
[('2', 'abc'), ('3', 'cd')]
I am very new to regular expression and learning.. Sorry if this is an easy one.
Thanks!
The immediate fix is getting rid of the capturing groups and using alternation to match either digits or chars other than square bracket chars:
import re
s = "2[abc]3[cd]ef"
res = re.findall(r"\d+|[^][]+", s)
print(res)
# => ['2', 'abc', '3', 'cd', 'ef']
See the regex demo and the Python demo. Details:
\d+ - one or more digits
| - or
[^][]+ - one or more chars other than [ and ]
Other solutions that might help are:
re.findall(r'\w+', s)
re.findall(r'\d+|[^\W\d_]+', s)
where \w+ matches one or more letters, digits, underscores and some more connector punctuation with diacritics and [^\W\d_]+ matches any one or more Unicode letters.
See this Python demo.
Don't try a regex that will find all part in the string, but rather a regex that is able to match each block, and \w (meaning [a-zA-Z0-9_]) feats well
s = "2[abc]3[cd]ef"
print(re.findall(r"\w+", s)) # ['2', 'abc', '3', 'cd', 'ef']
Or split on brackets
print(re.split(r"[\[\]]", s)) # ['2', 'abc', '3', 'cd', 'ef ']
Regex is intended to be used as a Regular Expression, your string is Irregular.
regex is being mostly used to find a specific pattern in a long text, text validation, extract things from text.
for example, in order to find a phone number in a string, I would use RegEx, but when I want to build a calculator and I need to extract operators/digits I would not, but I would rather want to write a python code to do that.

Regular expression to extract numbers around a slash "/"

I tried to extract numbers in a format like "**/*,**/*".
For example, for string "272/3,65/5", I want to get a list = [272,3,65,5]
I tried to use \d*/ and /\d* to extract 272,65 and 3,5 separately, but I want to know if there's a method to get a list I showed above directly. Thanks!
You can use [0-9] to represent a single decimal digit. so a better way of writing the regex would be something along the lines of [0-9]*/?[0-9]+,[0-9]+/?[0-9]*. I'm assuming you are trying to match fractional points such as 5,2/3, 23242/234,23, or 0,0.
Breakdown of the main components in [0-9]*/?[0-9]+,[0-9]+/?[0-9]*:
[0-9]* matches 0 or more digits
/? optionally matches a '/'
[0-9]+ matches at least one digit
My favorite tool for debugging regexes is https://regex101.com/ since it explains what each of the operators means and shows you how the regex is preforming on a sample of your choice as you write it.
#importing regular expression module
import re
# input string
s = "272/3,65/5"
# with findall method we can extract text by given format
result = re.findall('\d+',s)
print(result) # ['272', '3', '65', '5']
['272', '3', '65', '5']
where the \d means for matching digits in given input
if we use \d it gives output like below
result = re.findall('\d',s)
['2', '7', '2', '3', '6', '5', '5']
so we should use \d+ where + matches in possible matches upto fail match

python regular expression numbers in a row

I'm trying to check a string for a maximum of 3 numbers in a row for which I used:
regex = re.compile("\d{0,3}")
but this does not work for instance the string 1234 would be accepted by this regex even though the digit string if over length 3.
If you want to check a string for a maximum of 3 digits in string you need to use '\d{4,}' as you are only interest in the digits string over a length of 3.
import re
str='123abc1234def12'
print re.findall('\d{4,}',str)
>>> '[1234]'
If you use {0,3}:
str='123456'
print re.findall('\d{0,3}',str)
>>> ['123', '456', '']
The regex matches digit strings of maximum length 3 and empty strings but this cannot be used to test correctness. Here you can't check whether all digit strings are in length but you can easily check for digits string over the length.
So to test do something like this:
str='1234'
if re.match('\d{4,}',str):
print 'Max digit string too long!'
>>> Max digit string too long!
\d{0} matches every possible string. It's not clear what you mean by "doesn't work", but if you expect to match a string with digits, increase the repetition operator to {1,3}.
If you wish to exclude runs of 4 or more, try something like (?:^|\D)\d{1,3}(?:\D|$) and of course, if you want to capture the match, use capturing parentheses around \d{1,3}.
The method you have used is to find substrings with 0-3 numbers, it couldn't reach your expactation.
My solve:
>>> import re
>>> re.findall('\d','ds1hg2jh4jh5')
['1', '2', '4', '5']
>>> res = re.findall('\d','ds1hg2jh4jh5')
>>> len(res)
4
>>> res = re.findall('\d','23425')
>>> len(res)
5
so,next you just need use ‘if’ to judge the numbers of digits.
There could be a couple reasons:
Since you want \d to search for digits or numbers, you should probably spell that as "\\d" or r"\d". "\d" might happen to work, but only because d isn't special (yet) in a string. "\n" or "\f" or "\r" will do something totally different. Check out the re module documentation and search for "raw strings".
"\\d{0,3}" will match just about anything, because {0,3} means "zero or up to three". So, it will match the start of any string, since any string starts with the empty string.
or, perhaps you want to be searching for strings that are only zero to three numbers, and nothing else. In this case, you want to use something like r"^\d{0,3}$". The reason is that regular expressions match anywhere in a string (or only at the beginning if you are using re.match and not re.search). ^ matches the start of the string, and $ matches the end, so by putting those at each end you are not matching anything that has anything before or after \d{0,3}.

Categories

Resources