Python re finding string between underscore and ext - python

I have the following string
"1206292WS_R0_ws.shp"
I am trying to re.sub everything except what is between the second "_" and ".shp"
Output would be "ws" in this case.
I have managed to remove the .shp but for the life of me cannot figure out how to get rid of everything before the "_"
epass = "1206292WS_R0_ws.shp"
regex = re.compile(r"(\.shp$)")
x = re.sub(regex, "", epass)
Outputs
1206292WS_R0_ws
Desired output:
ws

you dont really need a regex for this
print epass.split("_")[-1].split(".")[0]
>>> timeit.timeit("epass.split(\"_\")[-1].split(\".\")[0]",setup="from __main__
import epass")
0.57268652953933608
>>> timeit.timeit("regex.findall(epass)",setup="from __main__ import epass,regex
0.59134766185007948
speed seems very similar for both but a tiny bit faster with splits
actually by far the fastest method is
print epass.rsplit("_",1)[-1].split(".")[0]
which takes 3 seconds on a string 100k long (on my system) vs 35+ seconds for either of the other methods
If you actually mean the second _ and not the last _ then you could do it
epass.split("_",2)[-1].split(".")
although depending on where the 2nd _ is a regex may be just as fast or faster

The regular expression you describe is ^[^_]*_[^_]*_(.*)[.]shp$
>>> import re
>>> s="1206292WS_R0_ws.shp"
>>> regex=re.compile(r"^[^_]*_[^_]*_(.*)[.]shp$")
>>> x=re.sub(regex,r"\1",s)
>>> print x
ws
Note: this is the regular expression as you describe it, not necessarily the best way to solve the actual problem.
everything except what is between the second "_" and ".shp"
Regexplation:
^ # Start of the string
[^_]* # Any string of characters not containing _
_ # Literal
[^_]* # Any string of characters not containing _
( # Start capture group
.* # Anything
) # Close capture group
[.]shp # Literal .shp
$ # End of string

Also if you dont want regex,you can use the rfind and find method
epass[epass.rfind('_')+1:epass.find('.')]

Perhaps _([^_]+)\.shp$ will do the job?

Simple version with RE
import re
re_f=re.compile('^.*_')
re_b=re.compile('\..*')
inp = "1206292WS_R0_ws.shp"
out = re_f.sub('',inp)
out = re_b.sub('',out)
print out
ws

Related

Remove all characters from string after the first digit with the least amount of code

So far, to remove all characters from a string after the first digit, I came up with
import re
s1 = "thishasadigit4here"
m = re.search(r"\d", s1)
result = s1[:m.start()]
print(result)
thishasadigit
Is there a compact way of coding this task?
As mentioned in the comments shorter code does not always imply better code (although it can be fun to go golfing).
That said (with re imported and mystring having your string), how about this:
result = re.split(r"\d", mystring, maxsplit=1)[0]
See https://pynative.com/python-regex-split/ for more information.
If it is just a digit you are trying to find, you probably don't need a regex -
import string
s1 = "thishasadigit4here"
min_digit_index = min(s1.find(_) for _ in string.digits if s1.find(_) > -1)
s1[:min_digit_index]
# 'thishasadigit'
All of that can be reduced to a single line -
import string
s1[:min(s1.find(_) for _ in string.digits if s1.find(_) > -1)]
# 'thishasadigit'

Best way to convert string to integer in Python

I have a spreadsheet with text values like A067,A002,A104. What is most efficient way to do this? Right now I am doing the following:
str = 'A067'
str = str.replace('A','')
n = int(str)
print n
Depending on your data, the following might be suitable:
import string
print int('A067'.strip(string.ascii_letters))
Python's strip() command takes a list of characters to be removed from the start and end of a string. By passing string.ascii_letters, it removes any preceding and trailing letters from the string.
If the only non-number part of the input will be the first letter, the fastest way will probably be to slice the string:
s = 'A067'
n = int(s[1:])
print n
If you believe that you will find more than one number per string though, the above regex answers will most likely be easier to work with.
You could use regular expressions to find numbers.
import re
s = 'A067'
s = re.findall(r'\d+', s) # This will find all numbers in the string
n = int(s[0]) # This will get the first number. Note: If no numbers will throw exception. A simple check can avoid this
print n
Here's some example output of findall with different strings
>>> a = re.findall(r'\d+', 'A067')
>>> a
['067']
>>> a = re.findall(r'\d+', 'A067 B67')
>>> a
['067', '67']
You can use the replace method of regex from re module.
import re
regex = re.compile("(?P<numbers>.*?\d+")
matcher = regex.search(line)
if matcher:
numbers = int(matcher.groupdict()["numbers"] #this will give you the numbers from the captured group
import string
str = 'A067'
print (int(str.strip(string.ascii_letters)))

Python regex - faster search

I need a way to optimize by regex, here is the string I am working with:
rr='JA=3262SGF432643;KL=ASDF43TQ;ME=FQEWF43344;JA=4355FF;PE=FDSDFHSDF;EB=SFGDASDSD;JA=THISONE;IH=42DFG43;'
and i want to take only JA=4355FF which is before JA=THISONE, so i did it this way:
aa='.*JA=([^.]*)JA=THISONE[^.]*'
aa=re.compile(aa)
print (re.findall(aa,rr))
and i get:
['4355FF;PE=FDSDFHSDF;EB=SFGDASDSD;']
My first problem is slow searching apropriete part of string (becouse the string which i want to search is too large and usually JA=THISONE is at the end of string)
And second problem is i dont get 4355FF but all string until JA=THISONE.
Can someone help me optimize my regex? Thank you!
I. Consider using string search instead of regexes:
thisone_pos = rr.find('JA=THISONE')
range_start = rr.rfind("JA=", 0, thisone_pos) + 3
range_end = rr.find(';', range_start)
print rr[range_start:range_end]
II. Consider flipping the string and constructing your regex in reverse:
re.findall(pattern, rr[::-1])
You could consider the following solution:
import re
rr='JA=3262SGF432643;KL=ASDF43TQ;ME=FQEWF43344;JA=4355FF;PE=FDSDFHSDF;EB=SFGDASDSD;JA=THISONE;IH=42DFG43;'
m = re.findall( r"(JA=[^;]+;)", rr )
# Print all hits
print m
# Print the hit preceding "JA=THISONE;"
print m[ m.index( "JA=THISONE;" ) - 1]
First, you look for all instances starting with "JA;" and then, you pick the last instance located before "JA=THISONE;".

Python complex regex replace

I'm trying to do a simple VB6 to c translator to help me port an open source game to the c language.
I want to be able to get "NpcList[NpcIndex]" from "With Npclist[NpcIndex]" using ragex and to replace it everywhere it has to be replaced. ("With" is used as a macro in VB6 that adds Npclist[NpcIndex] when ever it needs to until it founds "End With")
Example:
With Npclist[NpcIndex]
.goTo(245) <-- it should be replaced with Npclist[NpcIndex].goTo(245)
End With
Is it possible to use regex to do the job?
I've tried using a function to perfom another regex replace between the "With" and the "End With" but I can't know the text the "With" is replacing (Npclist[NpcIndex]).
Thanks in advance
I personally wouldn't trust any single-regex solution to get it right on the first time nor feel like debugging it. Instead, I would parse the code line-to-line and cache any With expression to use it to replace any . directly preceded by whitespace or by any type of brackets (add use-cases as needed):
(?<=[\s[({])\. - positive lookbehind for any character from the set + escaped literal dot
(?:(?<=[\s[({])|^)\. - use this non-capturing alternatives list if to-be-replaced . can occur on the beginning of line
import re
def convert_vb_to_c(vb_code_lines):
c_code = []
current_with = ""
for line in vb_code_lines:
if re.search(r'^\s*With', line) is not None:
current_with = line[5:] + "."
continue
elif re.search(r'^\s*End With', line) is not None:
current_with = "{error_outside_with_replacement}"
continue
line = re.sub(r'(?<=[\s[({])\.', current_with, line)
c_code.append(line)
return "\n".join(c_code)
example = """
With Npclist[NpcIndex]
.goTo(245)
End With
With hatla
.matla.tatla[.matla.other] = .matla.other2
dont.mind.me(.do.mind.me)
.next()
End With
"""
# use file_object.readlines() in real life
print(convert_vb_to_c(example.split("\n")))
You can pass a function to the sub method:
# just to give the idea of the regex
regex = re.compile(r'''With (.+)
(the-regex-for-the-VB-expression)+?
End With''')
def repl(match):
beginning = match.group(1) # NpcList[NpcIndex] in your example
return ''.join(beginning + line for line in match.group(2).splitlines())
re.sub(regex, repl, the_string)
In repl you can obtain all the information about the matching from the match object, build whichever string you want and return it. The matched string will be replaced by the string you return.
Note that you must be really careful to write the regex above. In particular using (.+) as I did matches all the line up to the newline excluded, which or may not be what you want(but I don't know VB and I have no idea which regex could go there instead to catch only what you want.
The same goes for the (the-regex-forthe-VB-expression)+. I have no idea what code could be in those lines, hence I leave to you the detail of implementing it. Maybe taking all the line can be okay, but I wouldn't trust something this simple(probably expressions can span multiple lines, right?).
Also doing all in one big regular expression is, in general, error prone and slow.
I'd strongly consider regexes only to find With and End With and use something else to do the replacements.
This may do what you need in Python 2.7. I'm assuming you want to strip out the With and End With, right? You don't need those in C.
>>> import re
>>> search_text = """
... With Np1clist[Npc1Index]
... .comeFrom(543)
... End With
...
... With Npc2list[Npc2Index]
... .goTo(245)
... End With"""
>>>
>>> def f(m):
... return '{0}{1}({2})'.format(m.group(1), m.group(2), m.group(3))
...
>>> regex = r'With\s+([^\s]*)\s*(\.[^(]+)\(([^)]+)\)[^\n]*\nEnd With'
>>> print re.sub(regex, f, search_text)
Np1clist[Npc1Index].comeFrom(543)
Npc2list[Npc2Index].goTo(245)

match regular expression where string to match is build from variables

I am having a problem. I am trying to match only the 2nd file.
ERIC_KM_NOW_SYSTEMIC_17001900_data.html
ERIC_KM_NOW_17001900_data.html
import re
viewTag = "KM_NOW"
regex = re.escape(viewTag) + r'(\d{8})' + re.escape('_data')
test = re.search(regex, "ERIC_KM_NOW_17001900_data.html")
print(test)
is that not correct?
I get type 'None'
You forgot a _ after KM_NOW.
(Hint: print(regex) to see it easily next time. ;-))

Categories

Resources