How to split a string in python by certain characters? - python

I am trying to solve a problem with prefix notation, but I am stuck on the part, where I want to split my string into an array:
If I have the input +22 2 I want to get the array to look like this:['+', '22', '2']
I tried using the
import re
function, but I am not sure how it works.
I tried the
word.split(' ')
method, but it only helps with the spaces.. any ideas?
P.S:
In the prefix notation I will also have + - and *.
So I need to split the string so the space is not in the array, and +, -, * is in the array
I am thinking of
word = input()
array = word.split(' ')
Then after that I am thinking of splitting a string by these 3 characters.
Sample input:
'+-12 23*67 1'
Output:
['+', '-', '12', '23', '*', '67', '1']

You can use re to find patterns in text, it seems you are looking for either one of these: +, - and * or a digit group. So compile a pattern that looks for that and find all that match this pattern and you will get a list:
import re
pattern = re.compile(r'([-+*]|\d+)')
string = '+-12 23*67 1'
array = pattern.findall(string)
print(array)
# Output:
# ['+', '-', '12', '23', '*', '67', '1']
Also a bit of testing (comparing your sample strings with the expected output):
test_cases = {
'+-12 23*67 1': ['+', '-', '12', '23', '*', '67', '1'],
'+22 2': ['+', '22', '2']
}
for string, correct in test_cases.items():
assert pattern.findall(string) == correct
print('Tests completed successfully!')
Pattern explanation (you can read about this in the docs linked below):
r'([-+*]|\d+)'
r in front to make it a raw string so that Python interprets all the characters literally, this helps with escape sequences in the regex pattern because you can escape them with one backslash
(...) parentheses around (they are not necessary in this case) indicate a group which can later be retrieved if needed (but in this case they don't matter much)
[...] indicates that any single character from this group can be matched so it will match if any of -, + and * will be present
| logical or, meaning that can match either side (to differentiate between numbers and special characters in this case)
\d special escape sequence for digits, meaning to match any digit, the + there indicates matching any one or more digits
Useful:
re module, the docs there explain what each character in the pattern does

Related

Regex matching for value between two patterns doesn't return all instances when the values are consecutive

I am trying to find all instance of a number within an equation. And for that, I wrote this python script:
re.findall(fr"([\-\+\*\/\(]|^)({val})([\-\+\*\/\)]|$)", equation)
Now, when I give it this: 20+5-20, and search for 20, the output is as expected: [('', '20', '+'), ('-', '20', '')]
But, when I simply do 20+20-5, it doesn't work anymore and I only get the first instance: [('', '20', '+')]
I don't understand why, it's not even a problem of 20 being at start and end, for example, this 5-20*4-20/3 will still match 20 very well. It just doesn't work when the value is repeated consecutively
how do I fix this?
Thank you
The reason your pattern initially does not work for 20+20-5 is that the character class after matching the first occurrence of 20 actually consumes the +
After consuming it, for the second occurrence of 20 right after it, this part of the pattern [\-\+\*\/\(]|^) can not match as there is no character to match with the character class, and it is not at the start of the string.
Using 20 for example at the place of {val} you can use lookarounds, which do not consume the value but only assert that it is present.
Note that you don't have to escape the values in the character class, and for the last assertion you don't have to add another non capture group.
(?:(?<=[-+*/(])|^)20(?=[-+*/)]|$)
Regex demo
import re
strings = [
"20+5-20",
"20+20-5"
]
val = 20
pattern = fr"(?:(?<=[-+*/(])|^){val}(?=[-+*/)]|$)"
for equation in strings:
print(re.findall(pattern, equation))
Output
['20', '20']
['20', '20']
I suggest just searching for all numbers (integer + decimal) in your expression, and then filtering for certain values:
inp = "20+5-20*3.20"
matches = re.findall(r'\d+(?:\.\d+)?', inp)
matches = [x for x in matches if x == '20']
print(matches) # ['20', '20']
Every number in your formula should only be surrounded by either arithmetic symbols, parentheses, or whitespace, all of which are non word characters.
I think I found an answer, still not sure how correct it is or why it's working and mine doesn't :/
re.findall(fr"(?:(?<=[\=\-\+\*\/\(])|^)({val})(?:(?=[\=\-\+\*\/\)])|$)", equation
basically, performing backward lookup and forward lookup to see if the value is between operations

Regular expression to extract numbers around a slash "/"

I tried to extract numbers in a format like "**/*,**/*".
For example, for string "272/3,65/5", I want to get a list = [272,3,65,5]
I tried to use \d*/ and /\d* to extract 272,65 and 3,5 separately, but I want to know if there's a method to get a list I showed above directly. Thanks!
You can use [0-9] to represent a single decimal digit. so a better way of writing the regex would be something along the lines of [0-9]*/?[0-9]+,[0-9]+/?[0-9]*. I'm assuming you are trying to match fractional points such as 5,2/3, 23242/234,23, or 0,0.
Breakdown of the main components in [0-9]*/?[0-9]+,[0-9]+/?[0-9]*:
[0-9]* matches 0 or more digits
/? optionally matches a '/'
[0-9]+ matches at least one digit
My favorite tool for debugging regexes is https://regex101.com/ since it explains what each of the operators means and shows you how the regex is preforming on a sample of your choice as you write it.
#importing regular expression module
import re
# input string
s = "272/3,65/5"
# with findall method we can extract text by given format
result = re.findall('\d+',s)
print(result) # ['272', '3', '65', '5']
['272', '3', '65', '5']
where the \d means for matching digits in given input
if we use \d it gives output like below
result = re.findall('\d',s)
['2', '7', '2', '3', '6', '5', '5']
so we should use \d+ where + matches in possible matches upto fail match

Get Regex Result as group Python

I am trying to find expression but I want the value as a whole, not seperate values:
test = "code.coding = 'DS-2002433E172062D';"
find = re.compile('[A-Z,\d]')
hey= re.findall(find,str(test))
print hey
response = ['D, 'S', '2', '0', '0', '2', '6', '6', '8', '0', '3', 'E', '1', '7', '2', '0', '6', '2', 'D']
However I want it as DS-2002433E172062D
Your current regex, [A-Z,\d] matches a single character in the range A-Z, or a digit. You want to capture your entire desired string in your match. Try out:
[a-zA-Z0-9]{2}-[a-zA-Z0-9]+
import re
regx = r'[a-zA-Z]{2}-[a-zA-Z0-9]+'
print(re.findall(regx, "code.coding = 'DS-2002433E172062D';"))
Output:
['DS-2002433E172062D']
Explanation:
[ // start of single character match
a-zA-Z0-9 // matches any letter or digit
]{2} // matches 2 times
- // matches the - character
[ // start of single character match
a-zA-Z0-9 // matches any letter or digit
]+ // matches between 1 and unlimited times (greedy)
As an improvement on the previous answer I often use a utility function like this:
Python Function:
def extractRegexPattern(theInputPattern, theInputString):
"""
Extracts the given regex pattern.
Param theInputPattern - the pattern to extract.
Param theInputString - a String to extract from.
"""
import re
sourceStr = str(theInputString)
prog = re.compile(theInputPattern)
theList = prog.findall(sourceStr)
return theList
Use like so:
test = "code.coding = 'DS-2002433E172062D';"
groups = extractRegexPattern("([A-Z,\d]+)+", test)
print(groups)
Outputs:
['DS', '2002433E172062D']
Regex needs () to make it a group:
You had:
'[A-Z,\d]'
Which means: any letter in the range A-Z or "," (comma) or the pattern \d (digits) as a single match ... and this yielded all of the matches separately (and subtly omits the dash which would have needed [-A-Z,\d] to be included).
For groups of the same type of matches try:
'([A-Z,\d]+)+'
Which means: any letter in the range A-Z or "," (comma) or the pattern \d (digits) one or more times per group match and one or more groups pre input ... and this yields all of the matches as groups (and clearly omits the dash which would have needed ([A-Z,\d]+)+ to be included).
The point: use () (parentheses) to denote groups in regular expression.
Lastly, while testing you regex it may be helpful to use a debugger like http://pythex.org until you get the hang of the pattern at hand.
Edit:
Sorry, I missed the part about how you wanted it as DS-2002433E172062D
For that result using groups try:
'([-A-Z,\d]+)+'
Which means: a - (dash) or any letter in the range A-Z or , (comma) or the pattern \d (digits) one or more times per group match and one or more groups pre input ... and this yields all of the matches as groups. Note the - (dash) MUST come before any ranges in most (all?) regex, and it is best in my experience to put dashes as the very beginning of the patern to be clear.
Hope this helps
You are very near , just change your regex pattern :
pattern=r'[A-Z,\d]'
to
pattern=r'[A-Z,\d-]+'
full code:
import re
test = "code.coding = 'DS-2002433E172062D';"
pattern=r'[A-Z,\d-]+'
print(re.findall(pattern,test))
output:
['DS-2002433E172062D']

Match a specific number of digits not preceded or followed by digits

I have a string:
string = u'11a2ee22b333c44d5e66e777e8888'
I want to find all k consecutive chunks of digits where n <= k <= m.
Using regular expression only:
say for example n=2 and m=3
using (?:\D|^)(\d{2,3})(?:\D|$)
re.findall(u'(?:\D|^)(\d{2,3})(?:\D|$)',u'11a2ee22b333c44d5e66e777e8888')
Gives this output:
['11', '333', '66']
Desired output:
['11', '22', '333', '44', '66', '777']
I know there are alternate solutions like:
filter(lambda x: re.match('^\d{2,3}$', x), re.split(u'\D',r'11a2ee22b333c44d5e66e777e8888'))
which gives the desired output, but I want to know what's wrong with the first approach?
It seems re.findall goes in sequence and skips the previous part when matched, so what can be done?
Note: The result you show in your question is not what I'm getting:
>>> import re
>>> re.findall(u'(?:\D|^)(\d{2,3})(?:\D|$)',u'11a2ee22b333c44d5e66e777e8888')
[u'11', u'22', u'44', u'66']
It's still missing some of the matches you want, but not the same ones.
The problem is that even though non-capturing groups like (?:\D|^) and (?:\D|$) don't capture what they match, they still consume it.
This means that the match which yields '22' has actually consumed:
e, with (?:\D|^) – not captured (but still consumed)
22 with (\d{2,3}) – captured
b with (?:\D|$) – not captured (but still consumed)
… so that b is no longer available to be matched before 333.
You can get the result you want with lookbehind and lookahead syntax:
>>> re.findall(u'(?<!\d)\d{2,3}(?!\d)',u'11a2ee22b333c44d5e66e777e8888')
[u'11', u'22', u'333', u'44', u'66', u'777']
Here, (?<!\d) is a negative lookbehind, checking that the match is not preceded by a digit, and (?!\d) is a negative lookahead, checking that the match is not followed by a digit. Crucially, these constructions do not consume any of the string.
The various lookahead and lookbehind constructions are described in the
Regular Expression Syntax section of Python's re documentation.
lookaround regex,\d{2,3} means 2 or 3 digits, (?=[a-z]) means letter after digits.
In [136]: re.findall(r'(\d{2,3})(?=[a-z])',string)
Out[136]: ['11', '22', '333', '44', '66', '777']
You could even generalize it with a function:
import re
string = "11a2ee22b333c44d5e66e777e8888"
def numbers(n,m):
rx = re.compile(r'(?<!\d)(\d{' + '{},{}'.format(n,m) + '})(?!\d)')
return rx.findall(string)
print(numbers(2,3))
# ['11', '22', '333', '44', '66', '777']

Removing leading digits from string using Python?

I see many questions asking how to remove leading zeroes from a string, but I have not seen any that ask how to remove any and all leading digits from a string.
I have been experimenting with combinations of lstrip, type function, isdigit, slice notation and regex without yet finding a method.
Is there a simple way to do this?
For example:
"123ABCDEF" should become "ABCDEF"
"ABCDEF" should stay as "ABCDEF"
"1A1" should become "A1"
"12AB3456CD" should become "AB3456CD"
A simple way could be to denote all digits with string.digits, which quite simply provides you with a string containing all digit characters '0123456789' to remove with string.lstrip.
>>> from string import digits
>>> s = '123dog12'
>>> s.lstrip(digits)
'dog12'
I want to point out that though both Mitch and RebelWithoutAPulse's answer are correct, they do not do the same thing.
Mitch's answer left-strips any characters in the set '1', '2', '3', '4', '5', '6', '7', '8', '9', '0'.
>>> from string import digits
>>> digits
'0123456789'
>>> '123dog12'.lstrip(digits)
'dog12'
RevelWithoutAPulse's answern on the other hand, left-strips any character known to be a digit.
>>> import re
>>> re.sub('^\d+', '', '123dog12')
'dog12'
So what's the difference? Well, there are two differences:
There are many other digit characters than the indo-arabic numerals.
lstrip is ambiguous on RTL languages. Actually, it removes leading matching characters, which may be on the right side. Regexp's ^ operator is more straightforward about it.
Here are a few examples:
>>> '١٩٨٤فوبار٤٢'.lstrip(digits)
'١٩٨٤فوبار٤٢'
>>> re.sub('^\d+', '', '١٩٨٤فوبار٤٢')
'فوبار٤٢'
>>> '𝟏𝟗𝟖𝟒foobar𝟒𝟐'.lstrip(digits)
'𝟏𝟗𝟖𝟒foobar𝟒𝟐'
>>> re.sub('^\d+', '', '𝟏𝟗𝟖𝟒foobar𝟒𝟐')
'foobar𝟒𝟐'
(note for the Arabic example, Arabic being read from right to left, it is correct for the number on the right to be removed)
So… I guess the conclusion is be sure to pick the right solution depending on what you're trying to do.
Using regexes from re:
import re
re.sub('^\d+', '', '1234AB456')
Becomes:
'AB456'
replaces any positive amount of digits at the beginning of the string with empty string.

Categories

Resources