Extract substring from a python string

Extract substring from a python string - python

I want to extract the string before the 9 digit number below:
tmp = place1_128017000_gw_cl_mask.tif
The output should be place1
I could do this:
tmp.split('_')[0] but I also want the solution to work for:
tmp = place1_place2_128017000_gw_cl_mask.tif where the result would be:
place1_place2
You can assume that the number will also be 9 digits long

Using regular expressions and the lookahead feature of regex, this is a simple solution:
tmp = "place1_place2_128017000_gw_cl_mask.tif"
m = re.search(r'.+(?=_\d{9}_)', tmp)
print(m.group())
Result:
place1_place2
Note that the \d{9} bit matches exactly 9 digits. And the bit of the regex that is in (?= ... ) is a lookahead, which means it is not part of the actual match, but it only matches if that follows the match.

Assuming we can phrase your problem as wanting the substring up to, but not including the underscore which is followed by all numbers, we can try:
tmp = "place1_place2_128017000_gw_cl_mask.tif"
m = re.search(r'^([^_]+(?:_[^_]+)*)_\d+_', tmp)
print(m.group(1)) # place1_place2

Use a regular expression:
import re
places = (
"place1_128017000_gw_cl_mask.tif",
"place1_place2_128017000_gw_cl_mask.tif",
)
pattern = re.compile("(place\d+(?:_place\d+)*)_\d{9}")
for p in places:
matched = pattern.match(p)
if matched:
print(matched.group(1))
prints:
place1
place1_place2
The regex works like this (adjust as needed, e.g., for less than 9 digits or a variable number of digits):
( starts a capture
place\d+ matches "places plus 1 to many digits"
(?: starts a group, but does not capture it (no need to capture)
_place\d+ matches more "places"
) closes the group
* means zero or many times the previous group
) closes the capture
\d{9} matches 9 digits
The result is in the first (and only) capture group.

Here's a possible solution without regex (unoptimized!):
def extract(s):
result = ''
for x in s.split('_'):
try: x = int(x)
except: pass
if isinstance(x, int) and len(str(x)) == 9:
return result[:-1]
else:
result += x + '_'
tmp = 'place1_128017000_gw_cl_mask.tif'
tmp2 = 'place1_place2_128017000_gw_cl_mask.tif'
print(extract(tmp)) # place1
print(extract(tmp2)) # place1_place2

Related

Regex Python with min a letter, a number and min a non-alphanumeric character

I would like to check if a string contains at least: 12 characters, min a letter, min a number and finally min a non-alphanumeric character.
I am in the process of creating a Regex but it does not meet my expectations.
Here is the Regex:
regex = re.compile('([A-Za-z]+[0-9]+\W+){12,}')
def is_valid(string):
return re.fullmatch(regex, string) is not None
test_string = "abdfjhfl58425!!"
print(is_valid(test_string))
When the string contains numbers after letters, it does not match!
Could you help me? Thank you.

Your regex is wrong. I found this on another post which describes a different scenario albeit very similar.
You can tweak this regex so that it reads like this:
^(.{0,12}|[^a-zA-Z]{1,}|[^\d]{1,}|[^\W]{1,})$|[\s]
Now what you have here is a regex that matches only when the password is invalid. Meaning that if you have no matches, the password is valid, and if you have matches the password is invalid. So you will need to alter the code to suit but try that regex above instead and it should work for all combinations.
The final working code would then be (with extra tests):
import re
regex = re.compile('^(.{0,12}|[^a-zA-Z]{1,}|[^\d]{1,}|[^\W]{1,})$|[\s]')
def is_valid(string):
return re.fullmatch(regex, string) is None
test_string = "abdfl58425B!!"
print(is_valid(test_string))
test_string = "ABRER58425B!!"
print(is_valid(test_string))
test_string = "eruaso58425!!"
print(is_valid(test_string))

Regex is not really suited to this task as it involves remembering counts of each type of character. You could construct a regex to do it but it would end up being very long and unreadable. Much simpler to write a function to count the number of occurrences of each type of character, something like:
def is_valid(test_string):
if len(test_string) >= 12 \
and len([c for c in test_string if c.isalpha()]) >= 1 \
and len([c for c in test_string if c.isnumeric()]) >= 1 \
and len([c for c in test_string if not c.isalnum()]) >= 1:
return True
else:
return False

If that helps: if you want to do the same thing but without ReGex, you can use this function that I had done! It works perfectly!
def is_strong_password(a_string):
if len(a_string) >= 12:
chiffre = 0
lettre = 0
alnum = 0
for x in a_string:
if x.isalpha():
lettre += 1
if x.isdigit():
chiffre += 1
if not x.isalnum():
alnum += 1
if lettre > 1 and chiffre > 1 and alnum > 1:
return True
else:
return False
else:
return False

You could four positive lookaheads:
(?i)(?=.{12})(?=.*[a-z])(?=.*\d)(?=.*[^a-z\d])
Demo
(?i) specifies that matches are to be case-indifferent.
The four positive lookaheads are as follows:
(?=.{12}) # assert that the string contains (at least) 12 characters
(?=.*[a-z]) # assert that the string contains a letter
(?=.*\d) # assert that the string contains a digit
(?=.*[^a-z\d]) # assert that the string contains a non-alphanumeric character

Replace subtext of a word

I want to replace this string
ramesh#gmail.com
to
rxxxxh#gxxxl.com
this is what I have done so far
print( re.sub(r'([A-Za-z](.*)[A-Za-z]#)','x', i))

One way to go is to use capturing groups and in the replacement for the parts that should be replaced with x return a repetition for number of characters in the matched group.
For the second and the fourth group use a negated character class [^ matching any char except the listed.
\b([A-Za-z])([^#\s]*)([A-Za-z]#[A-Za-z])([^#\s.]*)([A-Za-z])\b
Regex demo | Python demo
For example
import re
i = "ramesh#gmail.com"
res = re.sub(
r'\b([A-Za-z])([^#\s]*)([A-Za-z]#[A-Za-z])([^#\s.]*)([A-Za-z])\b',
lambda x: x.group(1) + "x" * len(x.group(2)) + x.group(3) + "x" * len(x.group(4)) + x.group(5),
i)
print(res)
Output
rxxxxh#gxxxl.com

How to use regex to only keep first n repeated words

If I have an input sentence
input = 'ok ok, it is very very very very very hard'
and what I want to do is to only keep the first three replica for any repeated word:
output = 'ok ok, it is very very very hard'
How can I achieve this with re or regex module in python?

One option could be to use a capturing group with a backreference and use that in the replacement.
((\w+)(?: \2){2})(?: \2)*
Explanation
( Capture group 1
(\w+) capture group 2, match 1+ word chars (The example data only uses word characters. To make sure they are no part of a larger word use a word boundary \b)
(?: \2){2} Repeat 2 times matching a space and a backreference to group 2. Instead of a single space you could use [ \t]+ to match 1+ spaces or tabs or use \s+ to match 1+ whitespace chars. (Note that that would also match a newline)
) Close group 1
(?: \2)* Match 0+ times a space and a backreference to group 2 to match the same words that you want to remove
Regex demo | Python demo
For example
import re
regex = r"((\w+)(?: \2){2})(?: \2)*"
s = "ok ok, it is very very very very very hard"
result = re.sub(regex, r"\1", s)
if result:
print (result)
Result
ok ok, it is very very very hard

You can group a word and use a backreference to refer to it to ensure that it repeats for more than 2 times:
import re
print(re.sub(r'\b((\w+)(?:\s+\2){2})(?:\s+\2)+\b', r'\1', input))
This outputs:
ok ok, it is very very very hard

One solution with re.sub with custom function:
s = 'ok ok, it is very very very very very hard'
def replace(n=3):
last_word, cnt = '', 0
current_word = yield
while True:
if last_word == current_word:
cnt += 1
else:
cnt = 0
last_word = current_word
if cnt >= n:
current_word = yield ''
else:
current_word = yield current_word
import re
replacer = replace()
next(replacer)
print(re.sub(r'\s*[\w]+\s*', lambda g: replacer.send(g.group(0)), s))
Prints:
ok ok, it is very very very hard

How to find the multiple instances of a data between two special characters in python

I am a beginner in Python so please excuse me if my question is two simple. I want to find the multiple instances of data between two special characters in a string and also count the number of instances. Until now I have the following code.
import re
count=0
myString="abcde(fghi)defggdfsidf(ijkl)gfders(gkjh)hgstfvd"
startString = '('
endString = ')'
for item in myString:
portString=myString[myString.find(startString)+len(startString):myString.find(endString)]
print(portString)
count=count+1
My desired output is
fghi
ijkl
gkjh
But my code always start the loop from the start and produces fghi. Can any one tell me what is the problem?

You can use non greedy regexes:
count=0
myString="abcde(fghi)defggdfsidf(ijkl)gfders(gkjh)hgstfvd"
rx = re.compile(r'\((.*?)\)') # non greedy version inside parens
pos = 0
while True:
m = rx.search(myString[pos:]) # search starting at pos (initially 0)
if m is None: break
count += 1
print(m.group(1))
pos += m.end() # next search will start past last ')'
Above solution only makes sense if parentheses are correctly balanced or if you want to start on first opening one and end of first closing next.
If you want to select text parenthesed text containing no opening or closing parentheses, you have to specify it in the regex:
myString="abcde(fghi)defg(gdfsidf(ijkl)g(fders(gkjh)hgstfvd"
rx = re.compile(r'\(([^()]*)\)')
pos = 0
while True:
m = rx.search(myString[pos:]) # search starting at pos (initially 0)
if m is None: break
count += 1
print(m.group(1))
pos += m.end() # next search will start past last ')'

As an alternative to regex if you'd prefer to keep the loop, note that String.find() can take an optional parameter to tell it where to start looking. Just keep track of the where the closing parenthesis is and start again from just after that.
Unfortunately it's not quite so simple as the loop condition will have to change too, so that it stops after hitting the last set of parentheses.
Something like this should do the trick:
count=0
myString="abcde(fghi)defggdfsidf(ijkl)gfders(gkjh)hgstfvd"
startString = '('
endString = ')'
endStringIndex = 0
while True:
startStringIndex = myString.find(startString, endStringIndex+1)
endStringIndex = myString.find(endString, endStringIndex+1)
if (startStringIndex == -1):
break
portString=myString[startStringIndex+len(startString):endStringIndex]
print(portString)
count+=1
Output:
fghi
ijkl
gkjh

You can use re.findall:
>>> myString = "abcde(fghi)defggdfsidf(ijkl)gfders(gkjh)hgstfvd"
>>> matches = re.findall(r'\((\w+)\)', myString)
>>> count = len(matches)
>>> print('\n'.join(matches))
fghi
ijkl
gkjh
>>> print(count)
3

clear and comprehensible way to calculate the string [12:3]

I new on python.
I have this string "[12:3]" and i what to calculate the difference between these two numbers.
Ex: 12 - 3 = 9
Of course I can do something (not very clear) like this:
num1 = []
num2 = []
s = '[12:3]'
dot = 0;
#find the ':' sign
for i in range(len(s)):
if s[i] == ':' :
dot = i
#left side
for i in range(dot):
num1.append(s[i])
#right side
for i in range(len(s) - dot-1):
num2.append(s[i+dot+1])
return str(int("".join(num1))-int("".join(num2))+1)
But i'm sure the is a more clear and comprehensible way.
Thanks!

You could use regex to pick the numbers out of your string:
import re
s = '[12:3]'
numbers = [int(x) for x in re.findall(r'\d+',s)]
return numbers[0]-numbers[1]

Or, without re
numbers = [int(x) for x in s.strip('[]').split(':')]
print numbers[0] - numbers[1]
prints
9

You should use regular expressions.
>>> import re
>>> match = re.match(r'\[(\d+):(\d+)\]', '[12:3]')
>>> match.groups()
('12', '3')
>>> a = int(match.groups()[0])
>>> b = int(match.groups()[1])
>>> a - b
9
The regular expression there says "match starting at the beginning of the string, find [, then any number of digits \d+ (and store them), then a :, then any number of digits \d+ (and store them), and finally ]". We then extract the stored digits using .groups() and do arithmetic on them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract substring from a python string - python

Assuming we can phrase your problem as wanting the substring up to, but not including the underscore which is followed by all numbers, we can try: tmp = "place1_place2_128017000_gw_cl_mask.tif" m = re.search(r'^([^_]+(?:_[^_]+)*)_\d+_', tmp) print(m.group(1)) # place1_place2

Related

Regex Python with min a letter, a number and min a non-alphanumeric character

Replace subtext of a word

How to use regex to only keep first n repeated words

How to find the multiple instances of a data between two special characters in python

clear and comprehensible way to calculate the string [12:3]

Categories

Resources