I am trying to capture something along the lines of
1/2x1 + 3x2 - 4/5x3
I will strip the spaces before hand so it is not necessary to capture them in the regular expression. The concern that's happening is that I want the preceding coefficient to have the option of being a fraction. So if I see a / then it must have \d+ following it. I don't necessarily care to capture the /.
Ideally I would extract the groups as such:
# first match
match.groups(1)
('1', '2', 'x1')
#second match
('+', '3', 'x2')
#third match
('-', '4', '5', 'x3')
Something that is (sort of) working is ([+-])?(\d)+(\/\d)?([a-zA-Z]+\d+). However I don't love that it also captures the preceding '/'
Example output:
>>> regexp = re.compile('([+-])?(\d)+(\/\d)?([a-zA-Z]+\d+)')
>>> expr = '1/2a3+1/8x2-4x3'
>>> match = regexp.search(expr)
>>> match.groups(1)
(1, '1', '/2', 'a3')
>>> expr = expr.replace(match.group(0), '')
>>> match = regexp.search(expr)
>>> match.groups(1)
('+', '1', '/8', 'x2')
>>> expr = expr.replace(match.group(0), '')
>>> match = regexp.search(expr)
>>> match.groups(1)
('-', '4', 1, 'x3')
In the first match, what does the first element 1 mean? I see the same thing in the third match, third element. In both of these - that particular "group" is missing. So is that just a way of being like "I matched, but I didn't match anything"?
Another issue with the above regex, is it makes the [+-] optional. I want it to be optional on the first term, but it is mandatory on subsequent terms.
Anyways the above is usable, I'll need to peel off the /, and I can sanitize the input to ensure the +- are always there, but it's not as elegant as I'm sure it can be.
Thanks for any help
You could rework your regex slightly to use capturing groups only for things you want to capture and then use re.findall to extract all matches at once:
regexp = re.compile(r'([+-])?(\d+)(?:/(\d))?([a-zA-Z]+\d+)')
res = regexp.findall(expr)
Output:
[
('', '1', '2', 'a3'),
('+', '1', '8', 'x2'),
('-', '4', '', 'x3')
]
Note when there is no fraction (or sign on the first value) there may be empty values ('') in the tuple, you could (if required) filter that out e.g.
[tuple(filter(lambda x:x, tup)) for tup in res]
# [('1', '2', 'a3'), ('+', '1', '8', 'x2'), ('-', '4', 'x3')]
however then you would face the difficulty of knowing which value in each tuple corresponded to which part of the expression.
Related
note that the final two numbers of this pattern for example FBXASC048 are ment to be ascii code for numbers (0-9)
input example list ['FBXASC048009Car', 'FBXASC053002Toy', 'FBXASC050004Human']
result example ['1009Car', '5002Toy', '2004Human']
what is the proper way to searches for any of these pattern in an input list
num_ascii = ['FBXASC048', 'FBXASC049', 'FBXASC050', 'FBXASC051', 'FBXASC052', 'FBXASC053', 'FBXASC054', 'FBXASC055', 'FBXASC056', 'FBXASC057']
and then replaces the pattern found with one of the items in the conv list but not randomally
because each element in the pattern list equals only one element in the conv_list
conv_list = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
this is the solution in mind:
it has two part
1st part--> is to find for ascii pattern[48, 49, 50, 51, 52, 53, 54, 55, 56,57]
and then replace those with the proper decimal matching (0-9)
so we will get new input list will be called input_modi_list that has ascii replaced with decimal
2nd part-->another process to use fixed pattern to replace using replace function which is this 'FBXASC0'
new_list3
for x in input_modi_list:
y = x.replace('FBXASC0', '')
new_list3.append(new_string)
so new_list3 will have the combined result of the two parts mentioned above.
i don't know if there would be a simplar solution or a better one maybe using regex
also note i don't have any idea on how to replace ascii with decimal for a list of items
I think this should do the trick:
import re
input_list = ['FBXASC048009Car', 'FBXASC053002Toy', 'FBXASC050004Human']
pattern = re.compile('FBXASC(\d{3,3})')
def decode(match):
return chr(int(match.group(1)))
result = [re.sub(pattern, decode, item) for item in input_list]
print(result)
Now, there is some explanation due:
1- the pattern object is a regular expression that will match any part of a string that starts with 'FBXASC' and ends with 3 digits (0-9). (the \d means digit, and {3,3} means that it should occur at least 3, and at most 3 times, i.e. exactly 3 times). Also, the parenthesis around \d{3,3} means that the three digits matched will be stored for later use (explained in the next part).
2- The decode function receives a match object, uses .group(1) to extract the first matched group (which in our case are the three digits matched by \d{3,3}), then uses the int function to parse the string into an integer (for example, convert '048' to 48), and finally uses the chr function to find which character has that ASCII-code. (for example chr(48) will return '0', and chr(65) will return 'A')
3- The final part applies the re.sub function to all elements of list which will replace each occurrence of the pattern you described (FBXASC048[3-digits]) with it's corresponding ASCII character.
You can see that this solution is not limited only to your specific examples. Any number can be used as long as it has a corresponding ASCII character recognized by the chr function.
But, if you do want to limit it just to the 48-57 range, you can simply modify the decode function:
def decode(match):
ascii_code = int(match.group(1))
if ascii_code >= 48 and ascii_code <= 57:
return chr(ascii_code)
else:
return match.group(0) # returns the entire string - no modification
This is how I would do it.
make the regex pattern by simply joining the strings with |:
>>> num_ascii = ['FBXASC048', 'FBXASC049', 'FBXASC050', 'FBXASC051', 'FBXASC052', 'FBXASC053', 'FBXASC054', 'FBXASC055', 'FBXASC056', 'FBXASC057']
>>> conv_list = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
>>> regex_pattern = '|'.join(num_ascii)
>>> regex_pattern
'FBXASC048|FBXASC049|FBXASC050|FBXASC051|FBXASC052|FBXASC053|FBXASC054|FBXASC055
|FBXASC056|FBXASC057'
make a look-up dictionary by simply zipping the two lists:
>>> conv_table = dict(zip(num_ascii, conv_list))
>>> conv_table
{'FBXASC048': '0', 'FBXASC049': '1', 'FBXASC050': '2', 'FBXASC051': '3', 'FBXASC
052': '4', 'FBXASC053': '5', 'FBXASC054': '6', 'FBXASC055': '7', 'FBXASC056': '8
', 'FBXASC057': '9'}
iterate over the data and replace the matched string with the corresponding digit:
>>> import re
>>> result = []
>>> for item in ['FBXASC048009Car', 'FBXASC053002Toy', 'FBXASC050004Human']:
... m = re.match(regex_pattern, item)
... matched_string = m[0]
... digit = (conv_table[matched_string])
... print(f'replacing {matched_string} with {digit}')
... result.append(item.replace(matched_string, digit))
...
replacing FBXASC048 with 0
replacing FBXASC053 with 5
replacing FBXASC050 with 2
>>> result
['0009Car', '5002Toy', '2004Human']
For example I get following input:
-9x+5x-2-4x+5
And I need to get following list:
['-9x', '5x', '-2', '-4x', '5']
Here is my code, but I don't know how to deal with minuses.
import re
text = '-3x-5x+2=9x-9'
text = re.split(r'\W', text)
print(text)
warning: I cannot use any libraries except re and math.
You could re.findall all groups of characters followed by + or - (or end-of-string $), then strip the + (which, like -, is still part of the following group) from the substrings.
>>> s = "-9x+5x-2-4x+5"
>>> [x.strip("+") for x in re.findall(r".+?(?=[+-]|$)", s)]
['-9x', '5x', '-2', '-4x', '5']
Similarly, for the second string with =, add that to the character group and also strip it off the substrings:
>>> s = '-3x-5x+2=9x-9'
>>> [x.strip("+=") for x in re.findall(r".+?(?=[+=-]|$)", s)]
>>> ['-3x', '-5x', '2', '9x', '-9']
Or apply the original comprehension to the substrings after splitting by =, depending on how the result should look like:
>>> [[x.strip("+") for x in re.findall(r".+?(?=[+-]|$)", s2)] for s2 in s.split("=")]
>>> [['-3x', '-5x', '2'], ['9x', '-9']]
In fact, now that I think of it, you can also just findall that match an optional minus, followed by some digits, and an optional x, with or without splitting by = first:
>>> [re.findall(r"-?\d+x?", s2) for s2 in s.split("=")]
[['-3x', '-5x', '2'], ['9x', '-9']]
One of many possible ways:
import re
term = "-9x+5x-2-4x+5"
rx = re.compile(r'-?\d+[a-z]?')
factors = rx.findall(term)
print(factors)
This yields
['-9x', '5x', '-2', '-4x', '5']
For your example data, you might split on either a plus or equals sign or split when asserting a minus sign on the right which is not at the start of the string.
[+=]|(?=(?<!^)-)
[+=] Match either + or =
| Or
(?=(?<!^)-) Positive lookahead, assert what is on the right is - but not at the start of the string
Regex demo | Python demo
Output for both example strings
['-9x', '5x', '-2', '-4x', '5']
['-3x', '-5x', '2', '9x', '-9']
I would like to split a string into sections of numbers and sections of text/symbols
my current code doesn't include negative numbers or decimals, and behaves weirdly, adding an empty list element on the end of the output
import re
mystring = 'AD%5(6ag 0.33--9.5'
newlist = re.split('([0-9]+)', mystring)
print (newlist)
current output:
['AD%', '5', '(', '6', 'ag ', '0', '.', '33', '--', '9', '.', '5', '']
desired output:
['AD%', '5', '(', '6', 'ag ', '0.33', '-', '-9.5']
Your issue is related to the fact that your regex captures one or more digits and adds them to the resulting list and digits are used as a delimiter, the parts before and after are considered. So if there are digits at the end, the split results in the empty string at the end to be added to the resulting list.
You may split with a regex that matches float or integer numbers with an optional minus sign and then remove empty values:
result = re.split(r'(-?\d*\.?\d+)', s)
result = filter(None, result)
To match negative/positive numbers with exponents, use
r'([+-]?\d*\.?\d+(?:[eE][-+]?\d+)?)'
The -?\d*\.?\d+ regex matches:
-? - an optional minus
\d* - 0+ digits
\.? - an optional literal dot
\d+ - one or more digits.
Unfortunately, re.split() does not offer an "ignore empty strings" option. However, to retrieve your numbers, you could easily use re.findall() with a different pattern:
import re
string = "AD%5(6ag0.33-9.5"
rx = re.compile(r'-?\d+(?:\.\d+)?')
numbers = rx.findall(string)
print(numbers)
# ['5', '6', '0.33', '-9.5']
As mentioned here before, there is no option to ignore the empty strings in re.split() but you can easily construct a new list the following way:
import re
mystring = "AD%5(6ag0.33--9.5"
newlist = [x for x in re.split('(-?\d+\.?\d*)', mystring) if x != '']
print newlist
output:
['AD%', '5', '(', '6', 'ag', '0.33', '-', '-9.5']
I have to match an expression similar to these
STAR 13
STAR 13, 23
STAR 1, 2 and 3 and STAR 1
But only capture the digits.
The number of digits is unspecified.
I've tried with STAR(?:\s*(?:,|and)\s*(#\d+))+
But it doesn't seem to capture the terms exactly.
No other dependencies could be added. Just the re module only.
The problem is a much larger one where STAR is another regular expression which has already been solved. Please don't bother about it and just consider it as a letter combination. Just include the letters STAR in regular expressions.
If you don't know the number of the digit r'[0-9]+' to specifie 1 digit or more. And to capture all number, you can use : r'(\d+)'
Do it with one regex:
re.findall("STAR ([0-9]+),? ?([0-9]+)? ?a?n?d? ?([0-9]+)?",a)
[('13', '', '')]
[('13', '23', '')]
[('1', '2', '3'), ('1', '', '')]
May be esaier and cleaner resultut with two step, first you need to have variable in a list like that:
tab = ["STAR 13","STAR 13, 23","STAR 1, 2 and 3 and STAR 1"]
list = filter(lambda x: re.match("^STAR",x),tab)
list_star = filter(lambda x: re.match("^STAR",x),tab)
for i in list_star:
re.findall(r'\d+', i)
['13']
['13', '23']
['1', '2', '3', '1']
You just need to put it in a new list after that my_digit += re.findall(r'\d+', i)
In 1 line:
import functools
tab = ["STAR 13","STAR 13, 23","STAR 1, 2 and 3 and STAR 1"]
digit=functools.reduce(lambda x,y: x+re.findall("\d+",y),filter(lambda x: re.match("^STAR ",x),tab),[])
['13', '13', '23', '1', '2', '3', '1']
I'm trying to extract numbers and both previous and following characters (excluding digits and whitespaces) of a string. The expected return of the function is a list of tuples, with each tuple having the shape:
(previous_sequence, number, next_sequence)
For example:
string = '200gr T34S'
my_func(string)
>>[('', '200', 'gr'), ('T', '34', 'S')]
My first iteration was to use:
def my_func(string):
res_obj = re.findall(r'([^\d\s]+)?(\d+)([^\d\s]+)?', string)
But this function doesn't do what I expect when I pass a string like '2AB3' I would like to output [('','2','AB'), ('AB','3','')] and instead, it is showing [('','2','AB'), ('','3','')], because 'AB' is part of the previous output.
How could I fix this?
Since there is no overlapping numbers, a single trailing
assertion should be all you need.
Something like ([^\d\s]+)?(\d+)(?=([^\d\s]+)?)
This ([^\d\s]*)(\d+)(?=([^\d\s]*)) if you care about
the difference between NULL and the empty string.
Instead of modifier + and ? you can simply use * :
>>> re.findall(r'([^\d\s]*)(\d+)([^\d\s]*)',string)
[('', '200', 'gr'), ('T', '34', 'S')]
But if you mean to match the overlapped strings you can use a positive look ahead to fine all the overlapped matches :
>>> re.findall(r'(?=([^\d\s]*)(\d+)([^\d\s]*))','2AB3')
[('', '2', 'AB'), ('AB', '3', ''), ('B', '3', ''), ('', '3', '')]
Another way can be using regex and functions!
import re
#'200gr T34S' '2AB3'
def s(x):
tmp=[]
d = re.split(r'\s+|(\d+)',x)
d = ['' if v is None else v for v in d] #remove None
t_ = [i for i in d if len(i)>0]
digits = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
nms = [i for i in t_ if i[0] in digits]
for i in nms:
if d.index(i)==0:
tmp.append(('',i,d[d.index(i)+1]))
elif d.index(i)==len(d):
tmp.append((d[d.index(i)-1],i,''))
else:
tmp.append((d[d.index(i)-1],i,d[d.index(i)+1]))
return tmp
print s('2AB3')
Prints-
[('', '2', 'AB'), ('AB', '3', '')]