Capture multiple substrings in one group

Capture multiple substrings in one group - python

Now, I have a folder path will contain a database table name and a ID, looks like:
path = '/something/else/TableName/000/123/456/789'
Of course I can match TableName/000/123/456/789 then split them by python script.
import re
matched = re.findall(r'.*?/(\w+(?:/\d+){4})', path)[0] # TableName/000/123/456/789
split_text = matched.split('/') # ['TableName', '000', '123', '456', '789']
table_name = split_text[0] # 'TableName'
id = int(''.join(split_text[1:])) # 123456789
.*?/(\w+(?:/\d+){4})
But I want to know, if there any function provided by regex can finish it in one step? I've tried these ways:
re.match(r'.*?/(?P<table_name>\w+)(?:/(?P<id>\d+)){4}', path).groupdict() # {'table_name': 'TableName', 'id': '789'}
re.split(r'.*?/(\w+)(?:/(\d+)){4}', path) # ['', 'TableName', '789', '']
re.sub(r'(.*?/)\w+(?:(/)\d+){4}', '', path) # '', full string has been replaced
.*?/(?P\w+)(?:/(?P\d+)){4}
.*?/(\w+)(?:/(\d+)){4}
Is there anyway else? Or I must use the python script above? I hope the result is {'table_name': 'TableName', 'id': '000123456789'} or ('TableName', '000123456789'), at least ('TableName', '000', '123', '456', '789').

Simplest way is to avoid using quantifier:
re.findall('(\w+)\/(\d+)\/(\d+)\/(\d+)\/(\d+)', path)
[('TableName', '000', '123', '456', '789')]

Easiest way would be to expand the grouping.
>>> match=re.search(r'.*?/(\w+)(?:/(\d+))(?:/(\d+))(?:/(\d+))(?:/(\d+))',a)
>>> match.groups()
('TableName', '000', '123', '456', '789')

Related

How Do I Split A String Using Multiple Delimiters (Python)

I am trying to further split an already split string to further clean it up and remove unnecessary bits of info. This is a URL split by '/'
['https:', '', 'expressjs.com', 'en', 'starter', 'hello-world.html']
I would like to be able to make it:
['https:', '', 'expressjs','com', 'en', 'starter', 'hello-world','html']
Any thoughts?

re.split can split a string on every match for your regex
>>> re.split('[/\.]', 'https://expressjs.com/en/starter/hello-world.html')
['https:', '', 'expressjs', 'com', 'en', 'starter', 'hello-world', 'html']
[/\.] matches any forward-slash or period character

Try this:
L = ['https:', '', 'expressjs.com', 'en', 'starter', 'hello-world.html']
L = [subitem for item in L for subitem in item.split('.')]
print(L)
Output:
['https:', '', 'expressjs', 'com', 'en', 'starter', 'hello-world', 'html']

Python3 regex findall

Here is my issue. Given below list:
a = ['COP' , '\t\t\t', 'Basis', 'Notl', 'dv01', '6m', '9m', '1y',
'18m', '2y', '3y', "15.6", 'mm', '4.6', '4y', '5y', '10', 'mm',
'4.6', '6y', '7y', '8y', '9y', '10y', '20y', 'TOTAL', '\t\t9.2' ]
I'm trying to get some outputs like this one. The most important note is the rows
After the first number ended on "y" or "m" will come a number only if it is there in the list
Example : ('3y', '15.6', '')
SAMPLE OUTPUT ( forget about the structure that is a tuple, jsut want teh values)
('6m', '', '')
('9m', '', '')
('1y', '', '')
('18m', '', '')
('2y', '', '')
('3y', '15.6', '')
('4y', '', '')
('5y', '10', '')
('6y', '', '')
('7y', '', '')
('8y', '', '')
('9y', '', '')
('10y', '', '')
('20y', '', '')
I used the following regex that should have returned :
all numbers followed by "y" or "m" => (\b\d+[ym]\b)
and then any number (integer or not) if it appears (meaning zero or more times)=>
(\b[0-9]+.[0-9]\b)
Here is what I did, using Python3 regex and re.findall(), but still got no result
rule2 = re.compile(r"(\b\d+[ym]\b)(\b[0-9]+.*[0-9]*\b)+")
a_str = " ".join(a)
OUT2 = re.findall(rule2, a_str)
print(OUT2)
# OUT2 >>[]
Why I'm not getting the correct result?

You cannot use word boundary twice. Since data is separated by non-letter/digits use \W+ instead.
Then, escape the dot, and make it optional, or you're not going to match 10. Don't use .* as it will match too much (regex greediness)
that yields more or less what you're looking for (note that matching strict numbers, integers or floats, is trickier than that, so this isn't perfect):
rule2 = re.compile(r"\b(\d+[ym])\W+([0-9]+\.?[0-9]*)\b")
a_str = " ".join(a)
OUT2 = re.findall(rule2, a_str)
print(OUT2)
[('3y', '15.6'), ('5y', '10')]

python regex for incomplete decimals numbers

I have a string of numbers which may have incomplete decimal reprisentation
for example
a = '1. 1,00,000.00 1 .99 1,000,000.999'
desired output
['1','1,00,000.00','1','.99','1,000,000.999']
so far i have tried the following 2
re.findall(r'[-+]?(\d+(?:[.,]\d+)*)',a)
which gives
['1', '1,00,000.00', '1', '99', '1,000,000.999']
which makes .99 to 99 which is not desired
while
re.findall(r'[-+]?(\d*(?:[.,]\d+)*)',a)
gives
['1', '', '', '1,00,000.00', '', '', '1', '', '.99', '', '1,000,000.999', '']
which gives undesirable empty string results as well
this is for finding currency values in a string so the commas separators don't have a set pattern or mat not be present at all

My suggestion is to use the regex below:
I've implemented a snippet in python.
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
result = re.split('/\.?\d\.?\,?/', a)
print result
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']

You can use re.split:
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
d = re.split('(?<=\d)\.\s+|(?<=\d)\s+', a)
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']

This regex will give you your desired output:
([0-9]+(?=\.))|([0-9,]+\.[0-9]+)|([0-9]+)|(\.[0-9]+)
You can test it here: https://regex101.com/r/VfQIJC/6

Python Regex match string between specific string and end character

I am building a file stripper to build a config report, and I have a very very long string as my base data. The following is a very small snippet of it, but it at least illustrates what I'm working with.
Snippet Example: DEFAULT_GATEWAY=192.168.88.1&DELVRY_AGGREGATION_INTERVAL0=1&DELVRY_AGGREGATION_INTERVAL1=1&DELVRY_SCHEDULE0=1&DELVRY_SNI0=192.168.88.158&DELVRY_USE_SSL_TLS1=0&
How would I go about matching the following:
between "DEFAULT_GATEWAY=" and "&"
between "DELVRY_AGGREGATION_INTERVAL0=" and "&"
between "DELVRY_AGGREGATION_INTERVAL1=" and "&"
between "DELVRY_SCHEDULE=" and "&"
between "DELVRY_SNI0=" and "&"
between "DELVRY_USE_SSL_TLS1=" and "&"
and building a dict with it like:
{"DEFAULT_GATEWAY":"192.168.88.1",
"DELVRY_AGGREGATION_INTERVAL0":"1",
"DELVRY_AGGREGATION_INTERVAL1":"1",
"DELVRY_SCHEDULE0":"1",
"DELVRY_SNI0":"0",
"DELVRY_USE_SSL_TLS1":"0"}
?

Here is a way to do it.
In [1]: input = 'DEFAULT_GATEWAY=192.168.88.1&DELVRY_AGGREGATION_INTERVAL0=1&DELVRY_AGGREGATION_INTERVAL1=1&DELVRY_SCHEDULE0=1&DELVRY_SNI0=192.168.88.158&DELVRY_USE_SSL_TLS1=0&'
In [2]: input.split('&')
Out[2]:
['DEFAULT_GATEWAY=192.168.88.1',
'DELVRY_AGGREGATION_INTERVAL0=1',
'DELVRY_AGGREGATION_INTERVAL1=1',
'DELVRY_SCHEDULE0=1',
'DELVRY_SNI0=192.168.88.158',
'DELVRY_USE_SSL_TLS1=0',
'']
In [3]: [keyval.split('=') for keyval in input.split('&') if keyval]
Out[3]:
[['DEFAULT_GATEWAY', '192.168.88.1'],
['DELVRY_AGGREGATION_INTERVAL0', '1'],
['DELVRY_AGGREGATION_INTERVAL1', '1'],
['DELVRY_SCHEDULE0', '1'],
['DELVRY_SNI0', '192.168.88.158'],
['DELVRY_USE_SSL_TLS1', '0']]
In [4]: dict(keyval.split('=') for keyval in input.split('&') if keyval)
Out[4]:
{'DEFAULT_GATEWAY': '192.168.88.1',
'DELVRY_AGGREGATION_INTERVAL0': '1',
'DELVRY_AGGREGATION_INTERVAL1': '1',
'DELVRY_SCHEDULE0': '1',
'DELVRY_SNI0': '192.168.88.158',
'DELVRY_USE_SSL_TLS1': '0'}
Notes
This is the input line
Split by & to get pairs of key-values. Note the last entry is empty
Split each entry by the equal sign while throwing away empty entries
Build a dictionary
Another Solution
In [8]: import urlparse
In [9]: urlparse.parse_qsl(input)
Out[9]:
[('DEFAULT_GATEWAY', '192.168.88.1'),
('DELVRY_AGGREGATION_INTERVAL0', '1'),
('DELVRY_AGGREGATION_INTERVAL1', '1'),
('DELVRY_SCHEDULE0', '1'),
('DELVRY_SNI0', '192.168.88.158'),
('DELVRY_USE_SSL_TLS1', '0')]
In [10]: dict(urlparse.parse_qsl(input))
Out[10]:
{'DEFAULT_GATEWAY': '192.168.88.1',
'DELVRY_AGGREGATION_INTERVAL0': '1',
'DELVRY_AGGREGATION_INTERVAL1': '1',
'DELVRY_SCHEDULE0': '1',
'DELVRY_SNI0': '192.168.88.158',
'DELVRY_USE_SSL_TLS1': '0'}

Split first by '&' to get a list of strings, then by '=', like this:
d = dict(kv.split('=') for kv in line.split('&'))

import re
keys = {"DEFAULT_GATEWAY",
"DELVRY_AGGREGATION_INTERVAL0",
"DELVRY_AGGREGATION_INTERVAL1",
"DELVRY_SCHEDULE0",
"DELVRY_SNI0",
"DELVRY_USE_SSL_TLS1"}
resdict = {}
for k in keys:
pat = '{}([^&])&'.format(k)
mo = re.search(pat, bigstring)
if mo is None: continue # no match
resdict[k] = mo.group(1)
will leave your desired result in resdict, if bigstring is the string you're searching in.
This assumes you know in advance which keys you'll be looking for, and you keep them in a set keys. If you don't know in advance the keys of interest, that's a very different issue of course.

turn key="value" string into a dict

I have a string with the following format:
author="PersonsName" date="1183050420" format="1.1" version="1.2"
I want to turn it in to a Python dict, a la:
{'author': 'PersonsName', 'date': '1183050420', 'format': '1.1', 'version': '1.2'}
I have tried to do so using re.split on the string as so:
attribs = (re.split('(=?" ?)', twikiattribs))
thinking I would get a list back like:
['author', 'PersonsName', 'date', '1183050420', 'format', '1.1', 'version', '1.2']
that then I could turn into a dict, but instead I'm getting:
['author', '="', 'PersonsName', '" ', 'date', '="', '1183050420', '" ', 'format', '="', '1.1', '" ', 'version', '="', '1.2', '"', '']
So, before I follow the re.split line further, is there generally a better way to achieve what I'm trying to do, and/or if the solution involves re.split, how can I write a regex that will split on any of the strings =", "_ (where "_" is a space char) or just " to just yield a list with the keys in the odd indices, and values in the even?

Use re.findall():
dict(re.findall(r'(\w+)="([^"]+)"', twikiattribs))
re.findall(), when presented with a pattern with multiple capturing groups, returns a list of tuples, each nested tuple containing the captured groups. dict() happily takes that output and interprets each nested tuple as a key-value pair.
Demo:
>>> import re
>>> twikiattribs = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
>>> re.findall(r'(\w+)="([^"]+)"', twikiattribs)
[('author', 'PersonsName'), ('date', '1183050420'), ('format', '1.1'), ('version', '1.2')]
>>> dict(re.findall(r'(\w+)="([^"]+)"', twikiattribs))
{'date': '1183050420', 'format': '1.1', 'version': '1.2', 'author': 'PersonsName'}
re.split() also behaves differently based on capturing groups; the text on which you split is included in the output if grouped. Compare the output with and without the capturing group:
>>> re.split('(=?" ?)', twikiattribs)
['author', '="', 'PersonsName', '" ', 'date', '="', '1183050420', '" ', 'format', '="', '1.1', '" ', 'version', '="', '1.2', '"', '']
>>> re.split('=?" ?', twikiattribs)
['author', 'PersonsName', 'date', '1183050420', 'format', '1.1', 'version', '1.2', '']
The re.findall() output is far easier to convert to a dictionary however.

you can also do it without re in one line:
>>> data = '''author="PersonsName" date="1183050420" format="1.1" version="1.2"'''
>>> {k:v.strip('"') for k,v in [i.split("=",1) for i in data.split(" ")]}
{'date': '1183050420', 'format': '1.1', 'version': '1.2', 'author': 'PersonsName'}
if whitespaces are allowed inside the values you can use this line:
>>> {k:v.strip('"') for k,v in [i.split("=",1) for i in data.split('" ')]}

The way I'd personally parse it:
import shlex
s = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
dict(x.split('=') for x in shlex.split(s))
Out[12]:
{'author': 'PersonsName',
'date': '1183050420',
'format': '1.1',
'version': '1.2'}

A non-regex list comprehension one liner:
>>> s = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
>>> print dict([tuple(x.split('=')) for x in s.split()])
{'date': '"1183050420"', 'format': '"1.1"', 'version': '"1.2"', 'author': '"PersonsName"'}

The problem is that you included parenthesis in your regex, which turns it into a captured group and includes it in the split. Assign attribs like this
attribs = (re.split('=?" ?', twikiattribs))
and it will work as expected. This does return a blank string (due to the final " in your input string), so you'll want to use attribs[:-1] when creating the dictionary.

Try
>>> str = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
>>> eval ('dict(' + str.replace(" ",",") + ')')
{'date': '1183050420', 'format': '1.1', 'version': '1.2', 'author': 'PersonsName'}
assuming as earlier the values have no space in them.
Beware of using eval() though. Bad things may happen for funny input. Don't use it on user input.

This might help some other people that re.findall() doesn't.
# grabbing input
input1 = dict,list,ect
# creating a phantom variable
Phantom = 'variable_name = ' + input1
# executing the phantom
phenomenon = exec(Phantom)
# storing the phantom variable in a live one
output = variable_name
# printing the stored phantom variable
print(output)
What it essentially does is adds a variable name to your input and creates that variable.
For example, if your list returns as "[[1,2][list][3,4]]" this executes as variable_name = [[1,2][list][3,4]]
In which activates it's original function.
It does create a PEP 8 error since the variable doesn't exist until it runs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Capture multiple substrings in one group - python

Simplest way is to avoid using quantifier: re.findall('(\w+)\/(\d+)\/(\d+)\/(\d+)\/(\d+)', path) [('TableName', '000', '123', '456', '789')]

Easiest way would be to expand the grouping. >>> match=re.search(r'.*?/(\w+)(?:/(\d+))(?:/(\d+))(?:/(\d+))(?:/(\d+))',a) >>> match.groups() ('TableName', '000', '123', '456', '789')

Related

How Do I Split A String Using Multiple Delimiters (Python)

Python3 regex findall

python regex for incomplete decimals numbers

Python Regex match string between specific string and end character

turn key="value" string into a dict

Categories

Resources