Python how to replace content in the capture group of regex? - python

-abc1234567-abc.jpg
I wish to remove -abc before .jpg, and get -abc1234567.jpg. I tried re.sub(r'\d(-abc).jpg$', '', string), but it will also replace contents outside of the capture group, and give me -abc123456. Is it possible to only replace the content in the capture group i.e. '-abc'?

One solution is to use positive lookahead as follows.
import re
p = re.compile(ur'(\-abc)(?=\.jpg)')
test_str = u"-abc1234567-abc.jpg"
subst = u""
result = re.sub(p, subst, test_str)
OR
You can use two capture groups as follows.
import re
p = re.compile(ur'(\-abc)(\.jpg)')
test_str = u"-abc1234567-abc.jpg"
subst = r"\2"
result = re.sub(p, subst, test_str)

If you only want to remove -abc in only jpg files, you could use:
re.sub(r"-abc\.jpg$", ".jpg", string)
To use your code as close as possible: you should place '()' around the part you want to keep, not the part you want to remove. Then use \g<NUMBER> to select that part of the string. So:
re.sub(r'(.*)-abc(\.jpg)$', '\g<1>\g<2>', string)

Related

Python Regex: Remove optional characters

I have a regex pattern with optional characters however at the output I want to remove those optional characters. Example:
string = 'a2017a12a'
pattern = re.compile("((20[0-9]{2})(.?)(0[1-9]|1[0-2]))")
result = pattern.search(string)
print(result)
I can have a match like this but what I want as an output is:
desired output = '201712'
Thank you.
You've already captured the intended data in groups and now you can use re.sub to replace the whole match with just contents of group1 and group2.
Try your modified Python code,
import re
string = 'a2017a12a'
pattern = re.compile(".*(20[0-9]{2}).?(0[1-9]|1[0-2]).*")
result = re.sub(pattern, r'\1\2', string)
print(result)
Notice, how I've added .* around the pattern, so any of the extra characters around your data is matched and gets removed. Also, removed extra parenthesis that were not needed. This will also work with strings where you may have other digits surrounding that text like this hello123 a2017a12a some other 99 numbers
Output,
201712
Regex Demo
You can just use re.sub with the pattern \D (=not a number):
>>> import re
>>> string = 'a2017a12a'
>>> re.sub(r'\D', '', string)
'201712'
Try this one:
import re
string = 'a2017a12a'
pattern = re.findall("(\d+)", string) # this regex will capture only digit
print("".join(p for p in pattern)) # combine all digits
Output:
201712
If you want to remove all character from string then you can do this
import re
string = 'a2017a12a'
re.sub('[A-Za-z]+','',string)
Output:
'201712'
You can use re module method to get required output, like:
import re
#method 1
string = 'a2017a12a'
print (re.sub(r'\D', '', string))
#method 2
pattern = re.findall("(\d+)", string)
print("".join(p for p in pattern))
You can also refer below doc for further knowledge.
https://docs.python.org/3/library/re.html

How can I "divide" words with regular expressions?

I have a sentence in which every token has a / in it. I want to just print what I have before the slash.
What I have now is basic:
text = less/RBR.....
return re.findall(r'\b(\S+)\b', text)
This obviously just prints the text, how do I cut off the words before the /?
Assuming you want all characters before the slash out of every word that contains a slash. This would mean e.g. for the input string match/this but nothing here but another/one you would want the results match and another.
With regex:
import re
result = re.findall(r"\b(\w*?)/\w*?\b", my_string)
print(result)
Without regex:
result = [word.split("/")[0] for word in my_string.split()]
print(result)
Simple and straight-forward:
rx = r'^[^/]+'
# anchor it to the beginning
# the class says: match everything not a forward slash as many times as possible
In Python this would be:
import re
text = "less/RBR....."
print re.match(r'[^/]+', text)
As this is an object, you'd probably like to print it out, like so:
print re.match(r'[^/]+', text).group(0)
# less
This should also work
\b([^\s/]+)(?=/)\b
Python Code
p = re.compile(r'\b([^\s/]+)(?=/)\b')
test_str = "less/RBR/...."
print(re.findall(p, test_str))
Ideone Demo

Python match and replace, what I do wrong?

I have reg exp for match some data (is it here) and now I try to replace all matched data with single : characetr
test_str = u"THERE IS MY DATA"
p = re.compile(ur'[a-z]+([\n].*?<\/div>[\n ]+<div class="large-3 small-3 columns">[\n ]+)[a-z]+', re.M|re.I|re.SE)
print re.sub(p, r':/1',test_str)
I try it on few other way but it's not replace any or replace not only matched but whole pattern
1)It's backslash issue.
Use : print re.sub(p, r':\1',test_str) not print re.sub(p, r':/1',test_str) .
2)You are replacing all the pattern with :\1, that means replace all the text with : followed by the first group in the regex.
To replace just the first group inside the text you should add two groups , before the first and after.
I hope this will fix the issue:
test_str = u"THERE IS MY DATA"
p = re.compile(ur'([a-z]+)([\n].*?<\/div>[\n ]+<div class="large-3 small-3 columns">[\n ]+)([a-z]+)', re.M|re.I|re.SE)
print re.sub(p, r'\1:\2\3',test_str)

Python: Regex: Detecting hyphenated names and non-hyphenated names with one regex

I need to extract people's names from a really long string.
Their names are in this format: LAST, FIRST.
Some of these people have hyphenated names. Some don't.
My attempt with a smaller string:
Input:
import re
text = 'Smith-Jones, Robert&Epson, Robert'
pattern = r'[A-Za-z]+(-[A-Za-z]+)?,\sRobert'
print re.findall(pattern, text)
Expected output:
['Smith-Jones, Robert', 'Epson, Robert']
Actual output:
['-Jones', '']
What am I doing wrong?
Use
import re
text = 'Smith-Jones, Robert&Epson, Robert'
pattern = r'[A-Za-z]+(?:-[A-Za-z]+)?,\sRobert'
print re.findall(pattern, text)
# => ['Smith-Jones, Robert', 'Epson, Robert']
Just make the capturing group non-capturing. The thing is that findall returns capture group values if they are specified in the regex pattern. So, the best way to solve this in this pattern is just replace (...)? with (?:...)?.
See IDEONE demo

python regex sub without order

I have following string "3 0ABC, mNone\n" and I want to remove m None and \n. The catch is that 'm', \n and None can be anywhere in the string in any order. I would appreciate any help.
I can do re.sub('[\nm,]','',string) or re.sub('None','',string) but don't know how to combine specially when the order doesn't matter.
If you want to remove m, None and \n you can use them as pattern together in a group. So you can use this regex:
(m|\\n|None)
Working demo
If you use the following code:
import re
p = re.compile(ur'(m|\\n|None)')
test_str = u"3 0ABC, mNone\n"
subst = u""
result = re.sub(p, subst, test_str)
print result
// Will show:
'3 0ABC, '

Categories

Resources