I have a string of numbers which may have incomplete decimal reprisentation
for example
a = '1. 1,00,000.00 1 .99 1,000,000.999'
desired output
['1','1,00,000.00','1','.99','1,000,000.999']
so far i have tried the following 2
re.findall(r'[-+]?(\d+(?:[.,]\d+)*)',a)
which gives
['1', '1,00,000.00', '1', '99', '1,000,000.999']
which makes .99 to 99 which is not desired
while
re.findall(r'[-+]?(\d*(?:[.,]\d+)*)',a)
gives
['1', '', '', '1,00,000.00', '', '', '1', '', '.99', '', '1,000,000.999', '']
which gives undesirable empty string results as well
this is for finding currency values in a string so the commas separators don't have a set pattern or mat not be present at all
My suggestion is to use the regex below:
I've implemented a snippet in python.
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
result = re.split('/\.?\d\.?\,?/', a)
print result
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']
You can use re.split:
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
d = re.split('(?<=\d)\.\s+|(?<=\d)\s+', a)
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']
This regex will give you your desired output:
([0-9]+(?=\.))|([0-9,]+\.[0-9]+)|([0-9]+)|(\.[0-9]+)
You can test it here: https://regex101.com/r/VfQIJC/6
Related
I am trying to get all numerical value (integers,decimal,float,scientific notation) from an expression and want to differentiate them from digits that are not realy number but part of a name. For example in the expression below.
230FIC000.PV>=-2e3 211FIC00.PV <= 20 100fic>-20.4 tic200 >=45 tic100 <-2E-4 fic123 >1
the first 230 is not a numerical value as it is part of a tag (230FIC100.PV).
Using the web tool regexp.com I come up with the following expression that works for the expression above.
(?!\s)(?<!\w)[+-]?((\d+\.\d*)|(\.\d+)|(\d+))([eE][+-]?\d+)?(\s)|(?<!\w)[0-9]\d+(?<!\s)$
However when I try to use the above expression in python re.findall() I receive as result a list with 5 tuples with 6 elements on each.
import re
pat = r'(?!\s)(?<!\w)[+-]?((\d+\.\d*)|(\.\d+)|(\d+))([eE][+-]?\d+)?(\s)|(?<!\w)[0-9]\d+(?<!\s)$'
exp = '230FIC000.PV>=-2e3 211FIC00.PV <= 20 100fic>-20.4 tic200 >=45 tic100 <-2E-4 fic123 >1 '
matches = re.findall(pat,exp)
The result is
special variables
function variables
0:('2', '', '', '2', 'e3', ' ')
1:('20', '', '', '20', '', ' ')
2:('20.4', '20.4', '', '', '', ' ')
3:('45', '', '', '45', '', ' ')
4:('2', '', '', '2', 'e4', ' ')
len():5
I would like some help to undestand what is happening and if there is any way to get this done in a similar way that happen on the regexp.com.
This should take care of it. (All the items are strings)
import re
st = '230FIC000.PV>=-2e3 211FIC00.PV <= 20 100fic>-20.4 tic200 >=45 tic100 <-2E-4 fic123 >1'
re.findall(r'-?[0-9]+\.?[0-9]*(?:[Ee]\ *-?\ *[0-9]+)|-?\d+\.\d+|\b\d+\b', st)
referred: How to extract numbers from strings,
Extracting scientific numbers from string,
and Extracting decimal values from string
I'm struggling to write a regex that extracts the following numbers in bold below.
I set up 3 different regex for each value, but since the last value might have a space in between I don't know how to accommodate an "AND" here.
tire = 'Tire: P275/65R18 A/S; 275/65R 18 A/T OWL;265/70R 17 A/T OWL;'
I have tried this and it is working for the first 2 but not for the last one. I'd like to have the last one in a single regex.
p1 = re.compile(r'(\d+)/')
p2 = re.compile(r'/(\d+)')
p3 = re.compile(r'(?=.*[R](\d+))(?=.*[R]\s(\d+))')
I've tried different stuff and this is the last code I tried with unsuccessful results
if I do this
p1.findall(tire), p2.findall(tire), p3.findall(tire)
I would like to see this:
(['275', '275', '265'], ['65', '65', '70'], ['18', '18', '17'])
You were almost there! You don't need three separate regular expressions.
Instead, use multiple capturing groups in a single regex.
(\d{3})\/(\d{2})R\s?(\d{2})
Try it: https://regex101.com/r/Xn6bry/1
Explanation:
(\d{3}): Capture three digits
\/: Match a forward-slash
(\d{2}): Capture two digits
R\s?: Match an R followed by an optional whitespace
(\d{2}): Capture two digits.
In Python, do:
p1 = re.compile(r'(\d{3})\/(\d{2})R\s?(\d{2})')
tire = 'Tire: P275/65R18 A/S; 275/65R 18 A/T OWL;265/70R 17 A/T OWL;'
matches = re.findall(p1, tire)
Now if you look at matches, you get
[('275', '65', '18'), ('275', '65', '18'), ('265', '70', '17')]
Rearranging this to the format you want should be pretty straightforward:
# Make an empty list-of-list with three entries - one per group
groups = [[], [], []]
for match in matches:
for groupnum, item in enumerate(match):
groups[groupnum].append(item)
Now groups is [['275', '275', '265'], ['65', '65', '70'], ['18', '18', '17']]
so i have some data i have been trying to clean up, its a list and it looks like this
a = [\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain]
i have tried to clean it up by doing this
a.replace("\n", "|")
the output turn out like this :
[london||18||20||30||||japan||6||80||2|||Spain]
if i do this:
a.replace("\n","")
i get this:
[london,"", "", 18,"","",20"","",30,"","","",""japan,"",""6,"","",80,"","",2"","","","",Spain]
can anyone explain why i am having multiple pipes, spaces and whats the best way to clean the data.
Assuming that your input is:
s = '\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain'
The issue is that there are multiple '\n' in-between data, therefore just replacing each '\n' with another character (say '|') will give you as many of the new characters as there were '\n'.
The simplest approach is to use str.split() to get the non-blank data:
l = list(s.split())
print(l)
# ['london', '18', '20', '30', 'japan', '6', '80', '2', 'Spain']
or, combine it with str.join(), if you want to have it separated by '|':
t = '|'.join(s.split())
print(t)
# london|18|20|30|japan|6|80|2|Spain
I tried it and got this:
a = ['\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain']
print(a[0].replace("\n", ""))
Output:
london182030japan6802Spain
Could you please clarify the exact input and the expected output? it does not seem correct yet and I have taken some liberties.
If your input was a string you can use split():
a = '\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain'
print(a.split())
Output:
['london', '18', '20', '30', 'japan', '6', '80', '2', 'Spain']
I am building a file stripper to build a config report, and I have a very very long string as my base data. The following is a very small snippet of it, but it at least illustrates what I'm working with.
Snippet Example: DEFAULT_GATEWAY=192.168.88.1&DELVRY_AGGREGATION_INTERVAL0=1&DELVRY_AGGREGATION_INTERVAL1=1&DELVRY_SCHEDULE0=1&DELVRY_SNI0=192.168.88.158&DELVRY_USE_SSL_TLS1=0&
How would I go about matching the following:
between "DEFAULT_GATEWAY=" and "&"
between "DELVRY_AGGREGATION_INTERVAL0=" and "&"
between "DELVRY_AGGREGATION_INTERVAL1=" and "&"
between "DELVRY_SCHEDULE=" and "&"
between "DELVRY_SNI0=" and "&"
between "DELVRY_USE_SSL_TLS1=" and "&"
and building a dict with it like:
{"DEFAULT_GATEWAY":"192.168.88.1",
"DELVRY_AGGREGATION_INTERVAL0":"1",
"DELVRY_AGGREGATION_INTERVAL1":"1",
"DELVRY_SCHEDULE0":"1",
"DELVRY_SNI0":"0",
"DELVRY_USE_SSL_TLS1":"0"}
?
Here is a way to do it.
In [1]: input = 'DEFAULT_GATEWAY=192.168.88.1&DELVRY_AGGREGATION_INTERVAL0=1&DELVRY_AGGREGATION_INTERVAL1=1&DELVRY_SCHEDULE0=1&DELVRY_SNI0=192.168.88.158&DELVRY_USE_SSL_TLS1=0&'
In [2]: input.split('&')
Out[2]:
['DEFAULT_GATEWAY=192.168.88.1',
'DELVRY_AGGREGATION_INTERVAL0=1',
'DELVRY_AGGREGATION_INTERVAL1=1',
'DELVRY_SCHEDULE0=1',
'DELVRY_SNI0=192.168.88.158',
'DELVRY_USE_SSL_TLS1=0',
'']
In [3]: [keyval.split('=') for keyval in input.split('&') if keyval]
Out[3]:
[['DEFAULT_GATEWAY', '192.168.88.1'],
['DELVRY_AGGREGATION_INTERVAL0', '1'],
['DELVRY_AGGREGATION_INTERVAL1', '1'],
['DELVRY_SCHEDULE0', '1'],
['DELVRY_SNI0', '192.168.88.158'],
['DELVRY_USE_SSL_TLS1', '0']]
In [4]: dict(keyval.split('=') for keyval in input.split('&') if keyval)
Out[4]:
{'DEFAULT_GATEWAY': '192.168.88.1',
'DELVRY_AGGREGATION_INTERVAL0': '1',
'DELVRY_AGGREGATION_INTERVAL1': '1',
'DELVRY_SCHEDULE0': '1',
'DELVRY_SNI0': '192.168.88.158',
'DELVRY_USE_SSL_TLS1': '0'}
Notes
This is the input line
Split by & to get pairs of key-values. Note the last entry is empty
Split each entry by the equal sign while throwing away empty entries
Build a dictionary
Another Solution
In [8]: import urlparse
In [9]: urlparse.parse_qsl(input)
Out[9]:
[('DEFAULT_GATEWAY', '192.168.88.1'),
('DELVRY_AGGREGATION_INTERVAL0', '1'),
('DELVRY_AGGREGATION_INTERVAL1', '1'),
('DELVRY_SCHEDULE0', '1'),
('DELVRY_SNI0', '192.168.88.158'),
('DELVRY_USE_SSL_TLS1', '0')]
In [10]: dict(urlparse.parse_qsl(input))
Out[10]:
{'DEFAULT_GATEWAY': '192.168.88.1',
'DELVRY_AGGREGATION_INTERVAL0': '1',
'DELVRY_AGGREGATION_INTERVAL1': '1',
'DELVRY_SCHEDULE0': '1',
'DELVRY_SNI0': '192.168.88.158',
'DELVRY_USE_SSL_TLS1': '0'}
Split first by '&' to get a list of strings, then by '=', like this:
d = dict(kv.split('=') for kv in line.split('&'))
import re
keys = {"DEFAULT_GATEWAY",
"DELVRY_AGGREGATION_INTERVAL0",
"DELVRY_AGGREGATION_INTERVAL1",
"DELVRY_SCHEDULE0",
"DELVRY_SNI0",
"DELVRY_USE_SSL_TLS1"}
resdict = {}
for k in keys:
pat = '{}([^&])&'.format(k)
mo = re.search(pat, bigstring)
if mo is None: continue # no match
resdict[k] = mo.group(1)
will leave your desired result in resdict, if bigstring is the string you're searching in.
This assumes you know in advance which keys you'll be looking for, and you keep them in a set keys. If you don't know in advance the keys of interest, that's a very different issue of course.
I'm trying to match pair of digits in a string and capture them in groups, however i seem to be only able to capture the last group.
Regex:
(\d\d){1,3}
Input String: 123456 789101
Match 1: 123456
Group 1: 56
Match 2: 789101
Group 1: 01
What I want is to capture all the groups like this:
Match 1: 123456
Group 1: 12
Group 2: 34
Group 3: 56
* Update
It looks like Python does not let you capture multiple groups, for example in .NET you could capture all the groups in a single pass, hence re.findall('\d\d', '123456') does the job.
You cannot do that using just a single regular expression. It is a special case of counting, which you cannot do with just a regex pattern. \d\d will get you:
Group1: 12
Group2: 23
Group3: 34
...
regex library in python comes with a non-overlapping routine namely re.findall() that does the trick. as in:
re.findall('\d\d', '123456')
will return ['12', '34', '56']
(\d{2})+(\d)?
I'm not sure how python handles its matching, but this is how i would do it
Try this:
import re
re.findall(r'\d\d','123456')
Is this what you want ? :
import re
regx = re.compile('(?:(?<= )|(?<=\A)|(?<=\r)|(?<=\n))'
'(\d\d)(\d\d)?(\d\d)?'
'(?= |\Z|\r|\n)')
for s in (' 112233 58975 6677 981 897899\r',
'\n123456 4433 789101 41586 56 21365899 362547\n',
'0101 456899 1 7895'):
print repr(s),'\n',regx.findall(s),'\n'
result
' 112233 58975 6677 981 897899\r'
[('11', '22', '33'), ('66', '77', ''), ('89', '78', '99')]
'\n123456 4433 789101 41586 56 21365899 362547\n'
[('12', '34', '56'), ('44', '33', ''), ('78', '91', '01'), ('56', '', ''), ('36', '25', '47')]
'0101 456899 1 7895'
[('01', '01', ''), ('45', '68', '99'), ('78', '95', '')]