Applying a Regex to a Substring Without using String Slice - python

I want to search for a regex match in a larger string from a certain position onwards, and without using string slices.
My background is that I want to search through a string iteratively for matches of various regex's. A natural solution in Python would be keeping track of the current position within the string and using e.g.
re.match(regex, largeString[pos:])
in a loop. But for really large strings (~ 1MB) string slicing as in largeString[pos:] becomes expensive. I'm looking for a way to get around that.
Side note: Funnily, in a niche of the Python documentation, it talks about an optional pos parameter to the match function (which would be exactly what I want), which is not to be found with the functions themselves :-).

The variants with pos and endpos parameters only exist as members of regular expression objects. Try this:
import re
pattern = re.compile("match here")
input = "don't match here, but do match here"
start = input.find(",")
print pattern.search(input, start).span()
... outputs (25, 35)

The pos keyword is only available in the method versions. For example,
re.match("e+", "eee3", pos=1)
is invalid, but
pattern = re.compile("e+")
pattern.match("eee3", pos=1)
works.

>>> import re
>>> m=re.compile ("(o+)")
>>> m.match("oooo").span()
(0, 4)
>>> m.match("oooo",2).span()
(2, 4)

You could also use positive lookbehinds, like so:
import re
test_string = "abcabdabe"
position=3
a = re.search("(?<=.{" + str(position) + "})ab[a-z]",test_string)
print a.group(0)
yields:
abd

Related

How do you find all instances of a substring, followed by a certain number of dynamic characters?

I'm trying to find all instances of a specific substring(a!b2 as an example) and return them with the 4 characters that follow after the substring match. These 4 following characters are always dynamic and can be any letter/digit/symbol.
I've tried searching, but it seems like the similar questions that are asked are requesting help with certain characters that can easily split a substring, but since the characters I'm looking for are dynamic, I'm not sure how to write the regex.
When using regex, you can use "." to dynamically match any character. Use {number} to specify how many characters to match, and use parentheses as in (.{number}) to specify that the match should be captured for later use.
>>> import re
>>> s = "a!b2foobar a!b2bazqux a!b2spam and eggs"
>>> print(re.findall("a!b2(.{4})", s))
['foob', 'bazq', 'spam']
import re
print (re.search(r'a!b2(.{4})')).group(1))
.{4} matches any 4 characters except special characters.
group(0) is the complete match of the searched string. You can read about group id here.
If you're only looking for how to grab the following 4 characters using Regex, what you are probably looking to use is the curly brace indicator for quantity to match: '{}'.
They go into more detail in the post here, but essentially you would do [a-Z][0-9]{X,Y} or (.{X,Y}), where X to Y is the number of characters you're looking for (in your case, you would only need {4}).
A more Pythonic way to solve this problem would be to make use of string slicing, and the index function however.
Eg. given an input_string, when you find the substring at index i using index, then you could use input_string[i+len(sub_str):i+len(sub_str)+4] to grab those special characters.
As an example,
input_string = 'abcdefg'
sub_str = 'abcd'
found_index = input_string.index(sub_str)
start_index = found_index + len(sub_str)
symbol = input_string[start_index: start_index + 4]
Outputs (to show it works with <4 as well): efg
Index also allows you to give start and end indexes for the search, so you could also use it in a loop if you wanted to find it for every sub string, with the start of the search index being the previous found index + 1.

Python re.sub() is not replacing every match

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string
It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'
re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO
Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)
EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.
The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

Why is the split() returning list objects that are empty? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']
I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)
Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')
If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

regular expression that reference a match from earlier part of expression

I'm looking for a regular expression that will identify a sequence in which an integer in the text specifies the number of trailing letters at the end of the expression. This specific example applies to identifying insertions and deletions in genetic data in the pileup format.
For example:
If the text I am searching is:
AtT+3ACGTTT-1AaTTa
I need to match the insertions and deletions, which in this case are +3ACG and -1A. The integer (n) portion can be any integer larger than 1, and I must capture the n trailing characters.
I can match a single insertion or deletion with [+-]?[0-9]+[ACGTNacgtn], but I can't figure out how to grab the exact number of trailing ACGTN's specified by the integer.
I apologize if there is an obvious answer here, I have been searching for hours. Thanks!
(UPDATE)
I typically work in Python. The one workaround I've been able to figure out with the re module in python is to call both the integers and span of every in/del and combine the two to extract the appropriate length of text.
For example:
>>> import re
>>> a = 'ATTAA$At^&atAA-1A+1G+4ATCG'
>>> expr = '[+-]?([0-9]+)[ACGTNacgtn]'
>>> ints = re.findall(expr, a) #returns a list of the integers
>>> spans = [i.span() for i in re.finditer(expr,a)]
>>> newspans = [(spans[i][0],spans[i][1]+(int(indel[i])-1)) for i in range(len(spans))]
>>> newspans
>>> [(14, 17), (17, 20), (20, 26)]
The resulting tuples allow me to slice out the indels. Probably not the best syntax, but it works!
You can use regular expression substitution passing a function as replacement... for example
s = "abcde+3fghijkl-1mnopqr+12abcdefghijklmnoprstuvwxyz"
import re
def dump(match):
start, end = match.span()
print s[start:end + int(s[start+1:end])]
re.sub(r'[-+]\d+', dump, s)
#output
# +3fgh
# -1m
# +12abcdefghijkl
It's not directly possible, regexes can't 'count' like that.
But if you're using a programming language that allows callbacks as a regex match evaluator (e.g. C#, PHP), then what you could do is have the regex as [+-]?([0-9]+)([ACGTNacgtn]+) and in the callback trim the trailing characters to the desired length.
e.g. for C#
var regexMatches = new List<string>();
Regex theRegex = new Regex(#"[+-]?([0-9]+)([ACGTNacgtn]+)");
text = theRegex.Replace(text, delegate(Match thisMatch)
{
int numberOfInsertsOrDeletes = Convert.ToInt32(thisMatch.Groups[1].Value);
string trailingString = thisMatch.Groups[2].Value;
if (numberOfInsertsOrDeletes > trailingString.Length)
{ trailingString = trailingString.Substring(0, numberOfInsertsOrDeletes); }
regexMatches.Add(trailingString);
return thisMatch.Groups[0].Value;
});
The simple Perl pattern for matching an integer followed by that number of any character is just:
(\d+)(??{"." x $1})
which is quite straight-forward, I think you’ll agree. For example, this snippet:
my $string = "AtT+3ACGTTT-1AaTTa";
print "Matched $&\n" while $string =~ m{
( \d+ ) # capture an integer into $1
(??{ "." x $1 }) # interpolate that many dots back into pattern
}xg;
Merrily prints out the expected
Matched 3ACG
Matched 1A
EDIT
Oh drat, I see you just added the Python tag since I began editing. Oops. Well, maybe this will be helpful to you anyway.
That said, if what you are actually looking for is fuzzy matching where you allow for some number of insertions and deletions (the edit distance), then Matthew Barnett’s regex library for Python will handle that. That doesn’t seem to be quite what you’re doing, as the insertions and deletions are actually represented in your strings.
But Matthew’s library is really very good and very interesting, and it even does many things that Perl cannot do. :) It’s a drop-in replacement for the standard Python re library.

Regex that match when beginning & end is of the same length

How do you make a regex that match when the beginning and the end is of the same length?
For example
>>> String = '[[A], [[B]], [C], [[D]]]'
>>> Result = re.findall(pattern, String)
>>> Result
>>> [ '[A]', '[[B]]', '[C]', '[[D]]' ]
Currently I use the pattern \[.*?\] but it resulted in
>>> ['[[A]', '[[B]', '[C]', '[[D]']
Thanks in advance.
You can define such a regular expression for a finite number of beginning/end characters (ie, something like "if it starts and ends with 1, or starts and ends with 2, or etc"). You, however, cannot do this for an unlimited number of characters. This is simply a fact of regular expressions. Regular expressions are the language of finite-state machines, and finite-state machines cannot do counting; at least the power of a pushdown-automaton (context-free grammar) is needed for that.
Put simply, a regular expression can say: "I saw x and then I saw y" but it cannot say "I saw x and then I saw y the same number of times" because it cannot remember how many times it saw x.
However, you can easily do this using the full power of the Python programming language, which is Turing-complete! Turing-complete languages can definitely do counting:
>>> string = '[[A], [[B]], [C], [[D]]]'
>>> sameBrackets = lambda s: len(re.findall('\[',s)) == len(re.findall('\]',s))
>>> filter(sameBrackets, string.split(", "))
['[[B]]', '[C]']
You can't. Sorry.
Python's regular expressions are an extension of "finite state automata", which only allow a finite amount of memory to be kept as you scan through the string for a match. This example requires an arbitrary amount of memory, depending on how many repetitions there are.
The only way in which Python allows more than just finite state is with "backreferences", which let you match an identical copy of a previously matched portion of the string -- but they don't allow you to match something with, say, the same number of characters.
You should try writing this by hand, instead.
To match balanced brackets you need a recursive regular expression. The stock re module doesn't support this syntax, but the alternative regex does:
import regex
r = r'\[(([^\[\]]+)|(?R))*\]'
print regex.match(r, '[[A], [[B]], [C], [[D]] ]') # ok
print regex.match(r, '[[A], [[B]], [C , [[D]] ]') # None
That expression basically says: match something surrounded by brackets, where "something" is either a series of non-brackets ([^\[\]]+) or the whole thing once again (?R).

Categories

Resources