Python Regex: Remove optional characters - python

I have a regex pattern with optional characters however at the output I want to remove those optional characters. Example:
string = 'a2017a12a'
pattern = re.compile("((20[0-9]{2})(.?)(0[1-9]|1[0-2]))")
result = pattern.search(string)
print(result)
I can have a match like this but what I want as an output is:
desired output = '201712'
Thank you.

You've already captured the intended data in groups and now you can use re.sub to replace the whole match with just contents of group1 and group2.
Try your modified Python code,
import re
string = 'a2017a12a'
pattern = re.compile(".*(20[0-9]{2}).?(0[1-9]|1[0-2]).*")
result = re.sub(pattern, r'\1\2', string)
print(result)
Notice, how I've added .* around the pattern, so any of the extra characters around your data is matched and gets removed. Also, removed extra parenthesis that were not needed. This will also work with strings where you may have other digits surrounding that text like this hello123 a2017a12a some other 99 numbers
Output,
201712
Regex Demo

You can just use re.sub with the pattern \D (=not a number):
>>> import re
>>> string = 'a2017a12a'
>>> re.sub(r'\D', '', string)
'201712'

Try this one:
import re
string = 'a2017a12a'
pattern = re.findall("(\d+)", string) # this regex will capture only digit
print("".join(p for p in pattern)) # combine all digits
Output:
201712

If you want to remove all character from string then you can do this
import re
string = 'a2017a12a'
re.sub('[A-Za-z]+','',string)
Output:
'201712'

You can use re module method to get required output, like:
import re
#method 1
string = 'a2017a12a'
print (re.sub(r'\D', '', string))
#method 2
pattern = re.findall("(\d+)", string)
print("".join(p for p in pattern))
You can also refer below doc for further knowledge.
https://docs.python.org/3/library/re.html

Related

Replace a substring between two substrings

How can I replace a substring between page1/ and _type-A with 222.6 in the below-provided l string?
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
Expected result:
https://homepage.com/home/page1/222.6_type-A/go
I tried:
import re
re.sub('page1/.*?_type-A','',l, flags=re.DOTALL)
But it also removes page1/ and _type-A.
You may use re.sub like this:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub(r'(?<=page1/).*?(?=_type-A)', replace_with, l))
Output:
https://homepage.com/home/page1/222.6_type-A/go
RegEx Demo
RegEx Breakup:
(?<=page1/): Lookbehind to assert that we have page1/ at previous position
.*?: Match 0 or more of any string (lazy)
(?=_type-A): Lookahead to assert that we have _type-A at next position
You can use
import re
l = 'https://'+'homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub('(page1/).*?(_type-A)',fr'\g<1>{replace_with}\2',l, flags=re.DOTALL))
Output: https://homepage.com/home/page1/222.6_type-A/go
See the Python demo online
Note you used an empty string as the replacement argument. In the above snippet, the parts before and after .*? are captured and \g<1> refers to the first group value, and \2 refers to the second group value from the replacement pattern. The unambiguous backreference form (\g<X>) is used to avoid backreference issues since there is a digit right after the backreference.
Since the replacement pattern contains no backslashes, there is no need preprocessing (escaping) anything in it.
This works:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
pattern = r"(?<=page1/).*?(?=_type)"
replace_with = '222.6'
s = re.sub(pattern, replace_with, l)
print(s)
The pattern uses the positive lookahead and lookback assertions, ?<= and ?=. A match only occurs if a string is preceded and followed by the assertions in the pattern, but does not consume them. Meaning that re.sub looks for a string with page1/ in front and _type behind it, but only replaces the part in between.

How to split string at any number followed by a period instead of a fixed delimiter

input:
string="1.Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking2.Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering3.Harrison-Lu-Java-9223391910-HarrisonLu#gmail.com-Mumbai-INDIA-Samsung-Engineering4.Andrew-Joseph-Javascript-9200091910-AndrewJoseph#gmail.com-Toronto-CANADA-Dell-Engineering5.Larry-Ken-SQL-8880091910-LarryKen#gmail.com-Newyork-USA-HP-Management"
expected output:
[
"1.Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking",
"2.Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering",
...
]
Attempt: I have tried using a string.split(range(0,5)+"."). What would be the best way to do this?
I don't usually reach for regular expressions first, but this cries out for re.split.
parts = re.split(r'(\d\.)`, string)
This does need a bit of post-processing. It creates:
['', '1.', 'Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking', '2.', 'Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering', ...
So you'll need to combine ever other element.
You could split using a regex with lookaround assertions that assert 1+ digits followed by a dot to the right using (?=\d+\.) and assert not the start of the string to the left using (?<!^)
(?<!^)(?=\d+\.)
Regex demo | Python demo
import re
pattern = r"(?<!^)(?=\d+\.)"
string="1.Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking2.Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering3.Harrison-Lu-Java-9223391910-HarrisonLu#gmail.com-Mumbai-INDIA-Samsung-Engineering4.Andrew-Joseph-Javascript-9200091910-AndrewJoseph#gmail.com-Toronto-CANADA-Dell-Engineering5.Larry-Ken-SQL-8880091910-LarryKen#gmail.com-Newyork-USA-HP-Management"
res = re.split(pattern, string)
print(res)
Output
[
'1.Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking',
'2.Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering',
'3.Harrison-Lu-Java-9223391910-HarrisonLu#gmail.com-Mumbai-INDIA-Samsung-Engineering',
'4.Andrew-Joseph-Javascript-9200091910-AndrewJoseph#gmail.com-Toronto-CANADA-Dell-Engineering',
'5.Larry-Ken-SQL-8880091910-LarryKen#gmail.com-Newyork-USA-HP-Management'
]
Or instead of splitting, you could also use a pattern to match 1 or more digits followed by a dot, and then match until the first occurrence of the same pattern or the end of the string.
\d+\..*?(?=\d+\.|$)
Regex demo | Python demo
import re
pattern = r"\d+\..*?(?=\d+\.|$)"
string="1.Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking2.Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering3.Harrison-Lu-Java-9223391910-HarrisonLu#gmail.com-Mumbai-INDIA-Samsung-Engineering4.Andrew-Joseph-Javascript-9200091910-AndrewJoseph#gmail.com-Toronto-CANADA-Dell-Engineering5.Larry-Ken-SQL-8880091910-LarryKen#gmail.com-Newyork-USA-HP-Management"
res = re.findall(pattern, string)

python string change character in string before certain character

I have this url and want to change the px value from 160 to 500. How can I do it without knowing the index of the character? I tried it with replace function
https://someurl.com//img_cache/381a58s7943437_037_160px.jpg?old
what I want:
https://someurl.com//img_cache/381a58s7943437_037_500px.jpg?old
The regexp here \d+(?=px) finds the digits that come before px, and they are then replaced by whatever you put in the argument new_res.
import re
string = "https://someurl.com//img_cache/381a58s7943437_037_160px.jpg?old"
new_res = "500"
out = re.sub("\d+(?=px)", new_res, string)
print(out)
Output:
https://someurl.com//img_cache/381a58s7943437_037_500px.jpg?old
You could use a regex pattern that looks for one or more digits, with a positive lookahead that asserts that what immediately follows is the substring "px.jpg":
import re
url = "https://someurl.com//img_cache/381a58s7943437_037_160px.jpg?old"
pattern = "\\d+(?=px.jpg)"
print(re.sub(pattern, "540", url))
Output:
https://someurl.com//img_cache/381a58s7943437_037_540px.jpg?old
>>>
import re
s = 'https://someurl.com//img_cache/381a58s7943437_037_160px.jpg?old'
s = re.sub(r'_\d+px\.', '_500px.', s)
>>> s
'https://someurl.com//img_cache/381a58s7943437_037_500px.jpg?old'

Retrieve regex full match

I'm new in regex expressions. I've read the documentation but I still have some questions.
I Have the following string:
[('15000042', 19)]
And I need to get the key, the comma and the value as a string.
like this:
15000042,19
I need this to enter these value as a comma separated value in a database.
I've tried the next regular expression:
([\w,]+)
but this only split the string into 3 substrings. Is there a way to get the full match?
https://regex101.com/r/vtYKOG/1
I'm using python
You match what you don't want to keep and use 3 groups instead of 1 and assemble your value using these 3 groups:
\[\('(\d+)'(,) (\d+)\)\]
Regex demo
For example:
import re
test_str = "[('15000042', 19)]"
result = re.sub(r"\[\('(\d+)'(,) (\d+)\)\]", r"\1\2\3", test_str)
if result:
print (result)
Result
15000042,19
Another option is to use only your character class [^\w,]+ and negate it so match not what is listed.
Then replace those characters with an empty string:
import re
test_str = "[('15000042', 19)]"
result = re.sub(r"[^\w,]+", "", test_str)
if result:
print (result)
Regex demo

python 3 regular expression match string meta-character

I want to write a line of regular expression that can match strings like "(2000)" with years in parentheses. then I can check if any string contains the substring "2000".
for example, I want the regex to match (2000) not 2000, or (20000),or (200).
That is to say: they have to have exactly four digits, the first digit between 1 and 2; they have to include the parentheses.
also 2000 is just an example I use but really I want to the regex to include all the possible years.
You have to escape the open and close paranthesis,
>>> import re
>>> str = """foo(2000)bar(1000)foobar2000"""
>>> regex = r'\(2000\)'
>>> result = re.findall(regex, str)
>>> print(result)
['(2000)']
OR
>>> import re
>>> str = """foo(2000)bar(1000)foobar(2014)barfoo(2020)"""
>>> regex = r'\([0-9]{4}\)'
>>> result = re.findall(regex, str)
>>> print(result)
['(2000)', '(1000)', '(2014)', '(2020)']
It matches all the four digit numbers(year's) present within the paranthesis.
Special characters need to be escaped with a backslash. A parenthesis ( becomes \(. Therefore (2000) becomes \(2000\).
Then you can do something like:
if re.search(r"\(2000\)", subject):
# Successful match
else:
# Match attempt failed
>>> import re
>>> x = re.match(r'\((\d*?)\)', "(2000)")
>>> x.group(1)
'2000'

Categories

Resources