Split string using capture groups - python

I have a two strings
/some/path/to/sequence2.1001.tif
and
/some/path/to/sequence_another_u1_v2.tif
I want to write a function so that both strings can be split up into a list by some regex and joined back together, without losing any characters.
so
def split_by_group(path, re_compile):
# ...
return ['the', 'parts', 'here']
split_by_group('/some/path/to/sequence2.1001.tif', re.compile(r'(\.(\d+)\.')
# Result: ['/some/path/to/sequence2.', '1001', '.tif']
split_by_group('/some/path/to/sequence_another_u1_v2.tif', re.compile(r'_[uv](\d+)')
# Result: ['/some/path/to/sequence_another_u', '1', '_v', '2', '.tif']
It's less important that the regex be exactly what I wrote above (but ideally, I'd like the accepted answer to use both). My only criteria are that the split string must be combinable without losing any digits and that each of the groups split in the way that I showed above (where the split occurs right at the start/end of the capture group and not the full string.
I made something with finditer but it's horribly hacky and I'm looking for a cleaner way. Can anyone help me out?

Changed your regex a little bit if you don't mind. Not sure if this works with your other cases.
def split_by_group(path, re_compile):
l = [s for s in re_compile.split(path) if s]
l[0:2] = [''.join(l[0:2])]
return l
split_by_group('/some/path/to/sequence2.1001.tif', re.compile('(\.)(\d+)'))
# Result: ['/some/path/to/sequence2.', '1001', '.tif']
split_by_group('/some/path/to/sequence_another_u1_v2.tif', re.compile('(_[uv])(\d+)'))
# Result: ['/some/path/to/sequence_another_u', '1', '_v', '2', '.tif']

Related

separate number from the string

the separate number from the string, but when successive '1', separate them
I think there must have a smart way to solve the question.
s = 'NNNN1234N11N1N123'
expected result is:
['1234','1','1','1','123']
I think what you want can be solved by using the re module
>>> import re
>>> re.findall('(?:1[2-90]+)|1', 'NNNN1234N11N1N123')
EDIT: As suggested in the comments by #CrafterKolyan the regular expression can be reduced to 1[2-90]*.
Outputs
['1234', '1', '1', '1', '123']
I also would use regular expressions (re module), but other function, namely re.split following way:
import re
s = 'NNNN1234N11N1N123'
output = re.split(r'[^\d]+|(?<=1)(?=1)',s)
print(output) # ['', '1234', '1', '1', '1', '123']
output = [i for i in output if i] # jettison empty strs
print(output) # ['1234', '1', '1', '1', '123']
Explanation: You want to split str to get list of strs - that is for what re.split is used. First argument of re.split is used to tell where split should happen, with everything which will be matched being removed if capturing groups are not used (similar to str method split), so I need to specify two places where cut happen, so I used | that is alternative and informed re.split to cut at:
[^\d]+ that is 1 or more non-digits
(?<=1)(?=1) that is empty str preceded by 1 and followed by 1, here I used feature named zero-length assertion (twice)
Note that re.split produced '' (empty str) before your desired output - this mean that first cut (NNNN in this case) spanned from start of str. This is expected behavior of re.split although we do not need that information in this case so we can jettison any empty strs, for which I used list comprehension.

Cut within a pattern using Python regex

Objective: I am trying to perform a cut in Python RegEx where split doesn't quite do what I want. I need to cut within a pattern, but between characters.
What I am looking for:
I need to recognize the pattern below in a string, and split the string at the location of the pipe. The pipe isn't actually in the string, it just shows where I want to split.
Pattern: CDE|FG
String: ABCDEFGHIJKLMNOCDEFGZYPE
Results: ['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
What I have tried:
I seems like using split with parenthesis is close, but it doesn't keep the search pattern attached to the results like I need it to.
re.split('CDE()FG', 'ABCDEFGHIJKLMNOCDEFGZYPE')
Gives,
['AB', 'HIJKLMNO', 'ZYPE']
When I actually need,
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
Motivation:
Practicing with RegEx, and wanted to see if I could use RegEx to make a script that would predict the fragments of a protein digestion using specific proteases.
A non regex way would be to replace the pattern with the piped value and then split.
>>> pattern = 'CDE|FG'
>>> s = 'ABCDEFGHIJKLMNOCDEFGZYPE'
>>> s.replace('CDEFG',pattern).split('|')
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
You can solve it with re.split() and positive "look arounds":
>>> re.split(r"(?<=CDE)(\w+)(?=FG)", s)
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
Note that if one of the cut sequences is an empty string, you would get an empty string inside the resulting list. You can handle that "manually", sample (I admit, it is not that pretty):
import re
s = "ABCDEFGHIJKLMNOCDEFGZYPE"
cut_sequences = [
["CDE", "FG"],
["FGHI", ""],
["", "FGHI"]
]
for left, right in cut_sequences:
items = re.split(r"(?<={left})(\w+)(?={right})".format(left=left, right=right), s)
if not left:
items = items[1:]
if not right:
items = items[:-1]
print(items)
Prints:
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
['ABCDEFGHI', 'JKLMNOCDEFGZYPE']
['ABCDE', 'FGHIJKLMNOCDEFGZYPE']
To keep the splitting pattern when you split with re.split, or parts of it, enclose them in parentheses.
>>> data
'ABCDEFGHIJKLMNOCDEFGZYPE'
>>> pieces = re.split(r"(CDE)(FG)", data)
>>> pieces
['AB', 'CDE', 'FG', 'HIJKLMNO', 'CDE', 'FG', 'ZYPE']
Easy enough. All the parts are there, but as you can see they have been separated. So we need to reassemble them. That's the trickier part. Look carefully and you'll see you need to join the first two pieces, the last two pieces, and the rest in triples. I simplify the code by padding the list, but you could do it with the original list (and a bit of extra code) if performance is a problem.
>>> pieces = [""] + pieces
>>> [ "".join(pieces[i:i+3]) for i in range(0,len(pieces), 3) ]
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
re.split() guarantees a piece for every capturing (parenthesized) group, plus a piece for what's between. With more complex regular expressions that need their own grouping, use non-capturing groups to keep the format of the returned data the same. (Otherwise you'll need to adapt the reassembly step.)
PS. I also like Bhargav Rao's suggestion to insert a separator character in the string. If performance is not an issue, I guess it's a matter of taste.
Edit: Here's a (less transparent) way to do it without adding an empty string to the list:
pieces = re.split(r"(CDE)(FG)", data)
result = [ "".join(pieces[max(i-3,0):i]) for i in range(2,len(pieces)+2, 3) ]
A safer non-regex solution could be this:
import re
def split(string, pattern):
"""Split the given string in the place indicated by a pipe (|) in the pattern"""
safe_splitter = "####SPLIT_HERE####"
safe_pattern = pattern.replace("|", safe_splitter)
string = string.replace(pattern.replace("|", ""), safe_pattern)
return string.split(safe_splitter)
s = "ABCDEFGHIJKLMNOCDEFGZYPE"
print(split(s, "CDE|FG"))
print(split(s, "|FG"))
print(split(s, "FGH|"))
https://repl.it/C448

Separate the string in Python, excluding some elements which contain separator

I have a really ugly string like this:
# ugly string follows:
ugly_string1 = SVEF/XX:1/60/24.02.16 07:30:00/"isk/kWh"/0/ENDTIME
# which also may look like this (part within quotes is different):
ugly_string2 = SVEF/XX:1/60/24.02.16 07:30:00/"kWh"/0/ENDTIME
and I'd like to separate it to get this list in Python:
['SVEF/XX:1', '60', '24.02.16 07:30:00', '"isk/kWh"', '0', 'ENDTIME']
# or from the second string:
['SVEF/XX:1', '60', '24.02.16 07:30:00', '"kWh"', '0', 'ENDTIME']
The first element (SVEF/XX:1) will always be the same, but the fourth element might or might not have the separator character in it (/).
I came up with regex which isolates the 1st and the 4th element (example here):
(?=(SVEF/XX:1))|(?=("(.*?)"))
but I just cannot figure out how to separate the rest of the string by / character, while excluding those two isolated elements?
I can do it with more "manual" approach, with regex like this (example here):
([^/]+/[^/]+)/([^/]+)/([^/]+)/("[^"]+")/([^/]+)/([^/]+)
but when I try this out in Python, I get extra empty elements for some reason:
['', 'SVEF/XX:1', '60', '24.02.16 07:30:00', '"isk/kWh"', '0', 'ENDTIME', '']
I could sanitize this result afterwards, but it would be great if I separate those strings without extra interventions.
In python, this can be done more easily (and with more room to generalize or adapt the approach in the future) with successive uses of split() and rsplit().
ugly_string = 'SVEF/XX:1/60/24.02.16 07:30:00/"isk/kWh"/0/ENDTIME'
temp = ugly_string.split("/", maxsplit=4)
result = [ temp[0]+"/"+temp[1] ] + temp[2:-1] + temp[-1].rsplit("/", maxsplit=2)
print(result)
Prints:
['SVEF/XX:1', '60', '24.02.16 07:30:00', '"isk/kWh"', '0', 'ENDTIME']
I use the second argument of split/rsplit to limit how many slashes are split;
I first split as many parts off the left as possible (i.e., 4), and rejoin parts 0 and 1
(the SVEF and XX). I then use rsplit() to make the rest of the split from the right. What's left in the middle is the quoted word, regardless of what it contains.
Rejoining the first two parts isn't too elegant, but neither is a format that allows / to appear both as a field separator and inside an unquoted field.
You can use re.findall testing first the quoted parts and making the beginning optional in the second branch:
re.findall(r'(?:^|/)("[^"]*"|(?:^[^/]*/)?[^/"]*)', s)
Python's csv module can handle multiple different delimiters, if you're ok with reinserting the " in the field where it seems to always exist, and reassembling the first field.
If you have a string, and want to treat it as a csv file, you can do this to prepare:
>>> import StringIO
>>> import csv
>>> ugly_string1 = 'SVEF/XX:1/60/24.02.16 07:30:00/"isk/kWh"/0/ENDTIME'
>>> f = StringIO.StringIO(ugly_string1)
Otherwise, assuming f is an open file, or the object we just created above:
>>> reader = csv.reader(f, delimiter='/')
>>> for row in reader:
>>> print(row)
['SVEF', 'XX:1', '60', '24.02.16 07:30:00', 'isk/kWh', '0', 'ENDTIME']
>>> first = "/".join(row[0:2])
Thank you all for your answers, they are all good and very helpful! However, after trying to test the performance of each one I came up with surprising results. You can take a look here,
but essentially, timit module ended up every time with results similar to this:
============================================================
example from my question:
0.21345195919275284
============================================================
Tushar's comment on my question:
0.21896087005734444
============================================================
alexis' answer (although not completely correct answer):
0.2645496800541878
============================================================
Casimir et Hippolyte's answer:
0.3663317859172821
============================================================
Simon Fraser's csv answer:
1.398559506982565
So, I decided to stick with my own example:
([^/]+/[^/]+)/([^/]+)/([^/]+)/("[^"]+")/([^/]+)/([^/]+)`)
but I'll reward your efforts nevertheless!

Better way to parse from regex?

I am doing the following to get the movieID:
>>> x.split('content')
['movieID" ', '="770672122">']
>>> [item for item in x.split('content')[1] if item.isdigit()]
['7', '7', '0', '6', '7', '2', '1', '2', '2']
>>> ''.join([item for item in x.split('content')[1] if item.isdigit()])
'770672122'
Would would be a better way to do this?
Without using a regular expression, you could just split by the double quotes and take the next to last field.
u="""movieID" content="7706">"""
u.split('"')[-2] # returns: '7706'
This trick is definitely the most readable, if you don't know about regular expressions yet.
Your string is a bit strange though as there are 3 double quotes. I assume it comes from an HTML file and you're only showing a small substring. In that case, you might make your code more robust by using a regular expression such as:
import re
s = re.search('(\d+)', u) # looks for multiple consecutive digits
s.groups() # returns: ('7706',)
You could make it even more robust (but you'll need to read more) by using a DOM-parser such as BeautifulSoup.
I assume x looks like this:
x = 'movieID content="770672122">'
Regex is definitely one way to extract the content. For example:
>>> re.search(r'content="(\d+)', x).group(1)
'770672122'
The above fetches one or more consecutive digits which follow the string content=".
It seems you could do something like the following if your string is like the below:
>>> import re
>>> x = 'movieID content="770672122">'
>>> re.search(r'\d+', x).group()
'770672122'

Grabbing multiple patterns in a string using regex

In python I'm trying to grab multiple inputs from string using regular expression; however, I'm having trouble. For the string:
inputs = 12 1 345 543 2
I tried using:
match = re.match(r'\s*inputs\s*=(\s*\d+)+',string)
However, this only returns the value '2'. I'm trying to capture all the values '12','1','345','543','2' but not sure how to do this.
Any help is greatly appreciated!
EDIT: Thank you all for explaining why this is does not work and providing alternative suggestions. Sorry if this is a repeat question.
You could try something like:
re.findall("\d+", your_string).
You cannot do this with a single regex (unless you were using .NET), because each capturing group will only ever return one result even if it is repeated (the last one in the case of Python).
Since variable length lookbehinds are also not possible (in which case you could do (?<=inputs.*=.*)\d+), you will have to separate this into two steps:
match = re.match(r'\s*inputs\s*=\s*(\d+(?:\s*\d+)+)', string)
integers = re.split(r'\s+',match.group(1))
So now you capture the entire list of integers (and the spaces between them), and then you split that capture at the spaces.
The second step could also be done using findall:
integers = re.findall(r'\d+',match.group(1))
The results are identical.
You can embed your regular expression:
import re
s = 'inputs = 12 1 345 543 2'
print re.findall(r'(\d+)', re.match(r'inputs\s*=\s*([\s\d]+)', s).group(1))
>>>
['12', '1', '345', '543', '2']
Or do it in layers:
import re
def get_inputs(s, regex=r'inputs\s*=\s*([\s\d]+)'):
match = re.match(regex, s)
if not match:
return False # or raise an exception - whatever you want
else:
return re.findall(r'(\d+)', match.group(1))
s = 'inputs = 12 1 345 543 2'
print get_inputs(s)
>>>
['12', '1', '345', '543', '2']
You should look at this answer: https://stackoverflow.com/a/4651893/1129561
In short:
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).

Categories

Resources