Python regex - faster search

Python regex - faster search - python

I need a way to optimize by regex, here is the string I am working with:
rr='JA=3262SGF432643;KL=ASDF43TQ;ME=FQEWF43344;JA=4355FF;PE=FDSDFHSDF;EB=SFGDASDSD;JA=THISONE;IH=42DFG43;'
and i want to take only JA=4355FF which is before JA=THISONE, so i did it this way:
aa='.*JA=([^.]*)JA=THISONE[^.]*'
aa=re.compile(aa)
print (re.findall(aa,rr))
and i get:
['4355FF;PE=FDSDFHSDF;EB=SFGDASDSD;']
My first problem is slow searching apropriete part of string (becouse the string which i want to search is too large and usually JA=THISONE is at the end of string)
And second problem is i dont get 4355FF but all string until JA=THISONE.
Can someone help me optimize my regex? Thank you!

I. Consider using string search instead of regexes:
thisone_pos = rr.find('JA=THISONE')
range_start = rr.rfind("JA=", 0, thisone_pos) + 3
range_end = rr.find(';', range_start)
print rr[range_start:range_end]
II. Consider flipping the string and constructing your regex in reverse:
re.findall(pattern, rr[::-1])

You could consider the following solution:
import re
rr='JA=3262SGF432643;KL=ASDF43TQ;ME=FQEWF43344;JA=4355FF;PE=FDSDFHSDF;EB=SFGDASDSD;JA=THISONE;IH=42DFG43;'
m = re.findall( r"(JA=[^;]+;)", rr )
# Print all hits
print m
# Print the hit preceding "JA=THISONE;"
print m[ m.index( "JA=THISONE;" ) - 1]
First, you look for all instances starting with "JA;" and then, you pick the last instance located before "JA=THISONE;".

Related

Python re.sub() is not replacing every match

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string

It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'

re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO

Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)

EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.

The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

Python: specify regex pattern that divides the string roughly into half and make a tuple

I have a string:
S = 'ABCKFDJRFMDLERKDFLKERWERJF'
I am trying to make a regex pattern that divides the string into half. I believe it's similar to:
word_1 = 'jupiter'
pattern_1 = re.compile('(\w+\s)'+word_1+'(\s\w+)')
But this is much more complicated because I need to first find the place that divides the string into half. What I want to do is:
For a function called divider,
split_S = divider('ABCKFNDNVMCNDSKDE' , 'NVM')
print(split_S)
('ABCKFND','CNDSKDE)
I really don't understand what to start first in this situation. If it's difficult to understand my question, please do tell me.

If I understand your problem correctly, you don't even need a regex, just use string.find():
def divider(s, splitter):
idx = s.find(splitter)
returrn s[:idx], s[idx+len(splitter):]

You can try this:
import re
S = 'ABCKFDJRFMDLERKDFLKERWERJF'
final_string = tuple(re.findall(".{"+str(len(S)//2)+"}", S))
Output:
('ABCKFDJRFMDLE', 'RKDFLKERWERJF')

Best way to convert string to integer in Python

I have a spreadsheet with text values like A067,A002,A104. What is most efficient way to do this? Right now I am doing the following:
str = 'A067'
str = str.replace('A','')
n = int(str)
print n

Depending on your data, the following might be suitable:
import string
print int('A067'.strip(string.ascii_letters))
Python's strip() command takes a list of characters to be removed from the start and end of a string. By passing string.ascii_letters, it removes any preceding and trailing letters from the string.

If the only non-number part of the input will be the first letter, the fastest way will probably be to slice the string:
s = 'A067'
n = int(s[1:])
print n
If you believe that you will find more than one number per string though, the above regex answers will most likely be easier to work with.

You could use regular expressions to find numbers.
import re
s = 'A067'
s = re.findall(r'\d+', s) # This will find all numbers in the string
n = int(s[0]) # This will get the first number. Note: If no numbers will throw exception. A simple check can avoid this
print n
Here's some example output of findall with different strings
>>> a = re.findall(r'\d+', 'A067')
>>> a
['067']
>>> a = re.findall(r'\d+', 'A067 B67')
>>> a
['067', '67']

You can use the replace method of regex from re module.
import re
regex = re.compile("(?P<numbers>.*?\d+")
matcher = regex.search(line)
if matcher:
numbers = int(matcher.groupdict()["numbers"] #this will give you the numbers from the captured group

import string
str = 'A067'
print (int(str.strip(string.ascii_letters)))

Python re finding string between underscore and ext

I have the following string
"1206292WS_R0_ws.shp"
I am trying to re.sub everything except what is between the second "_" and ".shp"
Output would be "ws" in this case.
I have managed to remove the .shp but for the life of me cannot figure out how to get rid of everything before the "_"
epass = "1206292WS_R0_ws.shp"
regex = re.compile(r"(\.shp$)")
x = re.sub(regex, "", epass)
Outputs
1206292WS_R0_ws
Desired output:
ws

you dont really need a regex for this
print epass.split("_")[-1].split(".")[0]
>>> timeit.timeit("epass.split(\"_\")[-1].split(\".\")[0]",setup="from __main__
import epass")
0.57268652953933608
>>> timeit.timeit("regex.findall(epass)",setup="from __main__ import epass,regex
0.59134766185007948
speed seems very similar for both but a tiny bit faster with splits
actually by far the fastest method is
print epass.rsplit("_",1)[-1].split(".")[0]
which takes 3 seconds on a string 100k long (on my system) vs 35+ seconds for either of the other methods
If you actually mean the second _ and not the last _ then you could do it
epass.split("_",2)[-1].split(".")
although depending on where the 2nd _ is a regex may be just as fast or faster

The regular expression you describe is ^[^_]*_[^_]*_(.*)[.]shp$
>>> import re
>>> s="1206292WS_R0_ws.shp"
>>> regex=re.compile(r"^[^_]*_[^_]*_(.*)[.]shp$")
>>> x=re.sub(regex,r"\1",s)
>>> print x
ws
Note: this is the regular expression as you describe it, not necessarily the best way to solve the actual problem.
everything except what is between the second "_" and ".shp"
Regexplation:
^ # Start of the string
[^_]* # Any string of characters not containing _
_ # Literal
[^_]* # Any string of characters not containing _
( # Start capture group
.* # Anything
) # Close capture group
[.]shp # Literal .shp
$ # End of string

Also if you dont want regex,you can use the rfind and find method
epass[epass.rfind('_')+1:epass.find('.')]

Perhaps _([^_]+)\.shp$ will do the job?

Simple version with RE
import re
re_f=re.compile('^.*_')
re_b=re.compile('\..*')
inp = "1206292WS_R0_ws.shp"
out = re_f.sub('',inp)
out = re_b.sub('',out)
print out
ws

How do I extract some string from a long string in Python?

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string

import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)

>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..

Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()

If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex - faster search - python

Related

Python re.sub() is not replacing every match

Python: specify regex pattern that divides the string roughly into half and make a tuple

Best way to convert string to integer in Python

Python re finding string between underscore and ext

How do I extract some string from a long string in Python?

Categories

Resources