Python Regex: Matching from end of string (reverse)

Python Regex: Matching from end of string (reverse) - python

I want to match a string with the following criteria:
Match any letters, followed by a '.', followed by letters, followed by end-of-line.
For example, for the string 'www.stackoverflow.com', the regex should return 'stackoverflow.com'. I have the following code that works:
my_string = '''
123.domain.com
123.456.domain.com
domain.com
'''
>>> for i in my_string.split():
... re.findall('[A-Za-z\.]*?([A-Za-z]+\.[a-z]+)$', i)
...
['domain.com']
['domain.com']
['domain.com']
>>>
The code snippet above works perfectly. But I'm sure there must be a more elegant way to achieve the same.
Is it possible to start the regex search/match starting from the end of the string, moving towards the start of the string? How would one code that type of regex? Or should I be using regex at all?

Your regex won't account for domains like domain.co.uk, so I would consider using something a little more robust. If you don't mind adding more dependencies to your script, there's a module named tldextract (pip install tldextract) that makes this pretty simple:
import tldextract
def get_domain(url):
result = tldextract.extract(url)
return result.domain + '.' + result.tld

I'm not sure from your example if you're just trying to get the last two parts of the domain name, or if you're trying to remove the numbers. If you just want the last parts of the domain, you can do something like:
for i in my_string.split():
'.'.join(i.split('.')[-2:])
This:
splits each string into a list of words, split where the '.' was originally, then
combines the final two words into a single string, with a '.' separator.
Or, like this:
>>> my_string = ['123.domain.com', '123.456.domain.com', 'domain.com', 'www.stackoverflow.com']
>>> ['.'.join(i.split('.')[-2:]) for i in my_string]
['domain.com', 'domain.com', 'domain.com', 'stackoverflow.com']

Related

Python: Extract values after decimal using regex

I am given a string which is number example "44.87" or "44.8796". I want to extract everything after decimal (.). I tried to use regex in Python code but was not successful. I am new to Python 3.
import re
s = "44.123"
re.findall(".","44.86")

Something like s.split('.')[1] should work

If you would like to use regex try:
import re
s = "44.123"
regex_pattern = "(?<=\.).*"
matched_string = re.findall(regex_pattern, s)
?<= a negative look behind that returns everything after specified character
\. is an escaped period
.* means "match all items after the period
This online regex tool is a helpful way to test your regex as you build it. You can confirm this solution there! :)

Regex Puzzle: Match a pattern only if it is between two $$ without indefinite look behind

I am writing a snippet for the Vim plugin UltiSnips which will trigger on a regex pattern (as supported by Python 3). To avoid conflicts I want to make sure that my snippet only triggers when contained somewhere inside of $$___$$. Note that the trigger pattern might contain an indefinite string in front or behind it. So as an example I might want to match all "a" in "$$ccbbabbcc$$" but not "ccbbabbcc". Obviously this would be trivial if I could simply use indefinite look behind. Alas, I may not as this isn't .NET and vanilla Python will not allow it. Is there a standard way of implementing this kind of expression? Note that I will not be able to use any python functions. The expression must be a self-contained trigger.

If what you are looking for only occurs once between the '$$', then:
\$\$.*?(a)(?=.*?\$\$)
This allows you to match all 3 a characters in the following example:
\$\$) Matches '$$'
.*? Matches 0 or more characters non-greedily
(?=.*?\$\$) String must be followed by 0 or more arbitrary characters followed by '$$'
The code:
import re
s = "$$ccbbabbcc$$xxax$$bcaxay$$"
print(re.findall(r'\$\$.*?(a)(?=.*?\$\$)', s))
Prints:
['a', 'a', 'a']

The following should work:
re.findall("\${2}.+\${2}", stuff)
Breakdown:
Looks for two '$'
"\${2}
Then looks for one or more of any character
.+
Then looks for two '$' again

I believe this regex would work to match the a within the $$:
text = '$$ccbbabbcc$$ccbbabbcc'
re.findall('\${2}.*(a).*\${2}', text)
# prints
['a']
Alternatively:
A simple approach (requiring two checks instead of one regex) would be to first find all parts enclosed in your quoting text, then check if your search string is present withing.
example
text = '$$ccbbabbcc$$ccbbabbcc'
search_string = 'a'
parts = re.findall('\${2}.+\${2}', text)
[p for p in parts if search_string in p]
# prints
['$$ccbbabbcc$$']

How can I write a regular expression to replace hash-like strings

There are some windows names and folders containing names like:
c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\57c8edb95df3f0ad4ee2dc2b8cfd4157
c:\windows\system32\config\systemprofile\appdata\locallow\microsoft\cryptneturlcache\metadata\be7ffd2fd84d3b32fd43dc8f575a9f28
c:\windows\softwaredistribution\download\ab1b092b40dee3ba964e8305ecc7d0d9
Notice how they end with a string that looks like a hash:
57c8edb95df3f0ad4ee2dc2b8cfd4157, be7ffd2fd84d3b32fd43dc8f575a9f28,
ab1b092b40dee3ba964e8305ecc7d0d9
I am not good with regex and I would like to know if there is a way to write a regex that would replace these hash-like names within a path with something like
"##HASH##"
The paths do not necessarily end with these, as these are usually folders/subfolders containing other folders of their own.
So my goal is to essentially get a path looking like:
c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\57c8edb95df3f0ad4ee2dc2b8cfd4157\some_subfolder\some_file.inf
to become:
c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata##HASH##\some_subfolder\some_file.inf
Is there a way to do that in Python ?
Thanks in advance.

If you noticed, the "hashes" are 32 characters. (IF THIS IS TRUE FOR ALL OF THEM) Then the regex is pretty straightforward.
For example with the last string you posted
import re
text = 'c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\\57c8edb95df3f0ad4ee2dc2b8cfd4157\some_subfolder\some_file.inf'
res = re.sub('\w{32}', '##HASH##', text)
print(res)
prints:
c:\windows\serviceprofiles\localserviceppdata\locallow\microsoft\cryptneturlcache\metadata\##HASH##\some_subfolder\some_file.inf
Notice how i escaped the \ with \\5 that's necessary to tell python it's a literal \5.
The \w{32} regex means "match any word character Exactly 32 times"

This might help:
import os
import re
uuid = re.compile('[0-9a-f]{30}\Z', re.I)
A = "c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\57c8edb95df3f0ad4ee2dc2b8cfd4157\sub_folder"
path = os.path.normpath(A)
path = path.split(os.sep)
path = "\\".join(["##"+i+"##" if uuid.match(i) else i for i in path])
print path
Result:
c:\windows\serviceprofiles\localserviceppdata\locallow\microsoft\cryptneturlcache\metadata\##c8edb95df3f0ad4ee2dc2b8cfd4157##\sub_folder
Note: I am compiling for 30 chars in length. You can modify that value in re.compile

python regex and replace

I am trying to learn python and regex at the same time and I am having some trouble in finding how to match till end of string and make a replacement on the fly.
So, I have a string like so:
ss="this_is_my_awesome_string/mysuperid=687y98jhAlsji"
What I'd want is to first find 687y98jhAlsji (I do not know this content before hand) and then replace it to myreplacedstuff like so:
ss="this_is_my_awesome_string/mysuperid=myreplacedstuff"
Ideally, I'd want to do a regex and replace by first finding the contents after mysuperid= (till the end of string) and then perform a .replace or .sub if this makes sense.
I would appreciate any guidance on this.

You can try this:
re.sub(r'[^=]+$', 'myreplacedstuff', ss)
The idea is to use a character class that exclude the delimiter (here =) and to anchor the pattern with $
explanation:
[^=] is a character class and means all characters that are not =
[^=]+ one or more characters from this class
$ end of the string
Since the regex engine works from the left to the right, only characters that are not an = at the end of the string are matched.

You can use regular expressions:
>>> import re
>>> mymatch = re.search(r'mysuperid=(.*)', ss)
>>> ss.replace(mymatch.group(1), 'replacing_stuff')
'this_is_my_awesome_string/mysuperid=replacing_stuff'
You should probably use #Casimir's answer though. It looks cleaner, and I'm not that good at regex :p.

Python split by regular expression

In Python, I am extracting emails from a string like so:
split = re.split(" ", string)
emails = []
pattern = re.compile("^[a-zA-Z0-9_\.-]+#[a-zA-Z0-9-]+.[a-zA-Z0-9-\.]+$");
for bit in split:
result = pattern.match(bit)
if(result != None):
emails.append(bit)
And this works, as long as there is a space in between the emails. But this might not always be the case. For example:
Hello, foo#foo.com
would return:
foo#foo.com
but, take the following string:
I know my best friend mailto:foo#foo.com!
This would return null. So the question is: how can I make it so that a regex is the delimiter to split? I would want to get
foo#foo.com
in all cases, regardless of punctuation next to it. Is this possible in Python?
By "splitting by regex" I mean that if the program encounters the pattern in a string, it will extract that part and put it into a list.

I'd say you're looking for re.findall:
>>> email_reg = re.compile(r'[a-zA-Z0-9_.-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
>>> email_reg.findall('I know my best friend mailto:foo#foo.com!')
['foo#foo.com']
Notice that findall can handle more than one email address:
>>> email_reg.findall('Text text foo#foo.com, text text, baz#baz.com!')
['foo#foo.com', 'baz#baz.com']

Use re.search or re.findall.
You also need to escape your expression properly (. needs to be escaped outside of character classes, not inside) and remove/replace the anchors ^ and $ (for example with \b), eg:
r"\b[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b"

The problem I see in your regex is your use of ^ which matches the start of a string and $ which matches the end of your string. If you remove it and then run it with your sample test case it will work
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","I know my best friend mailto:foo#foo.com!")
['foo#foo.com']
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","Hello, foo#foo.com")
['foo#foo.com']
>>>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex: Matching from end of string (reverse) - python

Related

Python: Extract values after decimal using regex

Regex Puzzle: Match a pattern only if it is between two $$ without indefinite look behind

How can I write a regular expression to replace hash-like strings

python regex and replace

Python split by regular expression

Categories

Resources