Regex to find text after linebreak in URL - python

I want to use regex to get a part of the string. I want to remove the kerberos and everything after it and get the Username
import re
text = 'Kerberos://DME.DMS.WORLD.DMSHEN/Username'
reg1 = re.compile(r"^((Kerberos?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$",text)
print(reg1)
Output
Username
I am new to regex and tried this regex but it doesn't seem to work

Your regex works just fine, but I am assuming you would like to make most of the groups non-capturing (you can do that by adding ?: to each group.
It will give you the following:
re.match(r"^(?:(?:Kerberos?|ftp):\/)?\/?(?:[^:\/\s]+)(?:(\/\w+)*\/)(?P<u>[\w\-\.]+[^#?\s]+)(?:.*)?(?:#[\w\-]+)?$",t).group('u')
Also, for future reference, try using https://regex101.com/ , it has an easy way to test your regex + explanations on each part.

How about this simple one:
import re
text = 'Kerberos://DME.DMS.WORLD.DMSHEN/Username'
reg1 = re.findall(r"//.*/(.*)", text)
print(''.join(reg1))
# Username

If you want you can use split instead of regex
text = 'Kerberos://DME.DMS.WORLD.DMSHEN/Username'
m = text.split('/')[-1]
print m

Related

How to use Regex to extract a string from a specific string until a specific symbol in python?

Question
Assume that I have a string like this:
example_text = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
Expectation
And I want to only extract the first url, which is
output = "https://www.example.com/link_1.html"
I think using regex to find the url start from "https" and end up '\' will be a good solution.
If so, how can I write the regex pattern?
I try something like this:
`
re.findall("https://([^\\\\)]+)", example_text)
output = ['www.example.com/link_1.html', 'www.example.com/link_2.html']
But then, I need to add "https://" back and choose the first item in the return.
Is there any other solution?
You need to tweak your regex a bit.
What you were doing before:
https://([^\\\\)]+) this matches your link but only captures the part after https:// since you used the capturing token after that.
Updated Regex:
(https\:\/\/[^\\\\)]+) this matches the link and also captures the whole token (escaped special characters to avoid errors)
In Code:
import re
input = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
print(re.findall("(https\:\/\/[^\\\\)]+)", input))
Output:
['https://www.example.com/link_1.html', "https://www.example.com/link_2.html'"]
You could also use (https\:\/\/([^\\\\)]+).html) to get the link with https:// and without it as a tuple. (this also avoids the ending ' that you might get in some links)
If you want only the first one, simply do output[0].
Try:
match = re.search(r"https://[^\\']+", example_text)
url = match.group()
print(url)
output:
https://www.example.com/link_1.html

Python: Extract values after decimal using regex

I am given a string which is number example "44.87" or "44.8796". I want to extract everything after decimal (.). I tried to use regex in Python code but was not successful. I am new to Python 3.
import re
s = "44.123"
re.findall(".","44.86")
Something like s.split('.')[1] should work
If you would like to use regex try:
import re
s = "44.123"
regex_pattern = "(?<=\.).*"
matched_string = re.findall(regex_pattern, s)
?<= a negative look behind that returns everything after specified character
\. is an escaped period
.* means "match all items after the period
This online regex tool is a helpful way to test your regex as you build it. You can confirm this solution there! :)

Erase duplicate emails

I'm trying to use regex in scrapy to find all email addresses on a page.
I'm using this code:
item["email"] = re.findall('[\w\.-]+#[\w\.-]+', response.body)
Which works almost perfectly: it grabs all the emails and gives them to me. However what I want is this: that it doesn't give me a repeat before it actually parses, even if there are more than one of the same email address.
I'm getting responses like this (which is correct):
{'email': ['billy666#stanford.edu',
'cantorfamilies#stanford.edu',
'cantorfamilies#stanford.edu',
'cantorfamilies#stanford.edu',
'footer-stanford-logo#2x.png']}
However I want to only show the unique addresses which would be
{'email': ['billy666#stanford.edu',
'cantorfamilies#stanford.edu',
'footer-stanford-logo#2x.png']}
If you want to throw in how to only collect the email and not that
'footer-stanford-logo#2x.png'
that is helpful also.
Thanks everyone!
Here is how you can get rid of the dupes and 'footer-stanford-logo#2x.png'-like thingies in your output:
import re
p = re.compile(r'[\w.-]+#(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b')
test_str = "{'email': ['billy666#stanford.edu',\n 'cantorfamilies#stanford.edu',\n 'cantorfamilies#stanford.edu',\n 'cantorfamilies#stanford.edu',\n 'footer-stanford-logo#2x.png']}"
print(set(p.findall(test_str)))
See the Python demo
The regex will look like
[\w.-]+#(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^
See demo
The negative lookahead (?![\w.-]*\.(?:png|jpe?g|gif)\b) will disallow all matches with png, jpg, etc. extensions at the end of the word (\b is a word boundary, and in this case, it is a trailing word boundary).
Dupes can easily be removed with a set - it is the least troublesome part here.
FINAL SOLUTION:
item["email"] = set(re.findall(r'[\w.-]+#(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b', response.body))
item["email"] = set(re.findall('[\w\.-]+#[\w\.-]+', response.body))
Can't you just use a set instead of a list?
item["email"] = set(re.findall('[\w\.-]+#[\w\.-]+', response.body))
And if you really want a list then:
item["email"] = list(set(re.findall('[\w\.-]+#[\w\.-]+', response.body)))

Regex for uppercase and underscores between percentage signs

Regex has never been my strong point. In python I'm attempting to build an expression which matches substrings such as this:
%MATCH%
%MATCH_1%
$THIS_IS_A_MATCH%
It would be extracted by a %MATCH% like this or %LIKE_THIS%
I ended up with this (logically, but does not seem to work): %[A-Z0-9_]*$%
So where am I going wrong on this?
You can use a simple regex like this:
[%$]\w+[%$] <-- Notice I put $ because of your sample
On the other hand, if you only want uppercase you can use:
[%$][A-Z_\d]+[%$]
If you only want to match content within %, you could also use:
%.+?%
Python code
import re
p = re.compile(ur'[%$]\w+[%$]')
test_str = u"%MATCH%\n\n%MATCH_1%\n\n$THIS_IS_A_MATCH%"
re.findall(p, test_str)
Btw, the problem with your regex is below:
%[A-Z0-9_]*$%
^--- Remove this dolar sign

How can I only take the regex without using re.split()?

I'm sorry for the title, I don't know how to ask this properly.
There's an HTML file I want to parse, using python2.7's re library. I know that I should use BeautifulSoup or stuff like that for parsing, but that doesn't really matter here.
There's a part in this HTML file: <br>(40%)</TD>. In this example, the number (here, it's 40) changes every now and then.
I want to use this number as a variable. To get that done, I'm using re.split() to split off everything preceding 40 and everything following 40 :
html = getHtmlTextFromWebsite()
very_important_number = int(re.split("%\)</TD>", (re.split("<br>\(", html))[1])[0])
I find this a very cumbersome way to do this and I'm wondering whether there are better ways.
If all you need is to have the number, you should use the match (first match) or finditer (all matches) function instead.
import re
p = re.compile("<br>\((\d+%)\)<\/TD>", re.IGNORECASE)
for m in p.finditer(html_code):
print m.start(), m.group()
It will also give you the position of each number found.
This is not the regex you asked, but it strips everything before and after the number:
very_important_number = int(html.strip("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ%()<>br/"))
For regex, you can use the re.match method.
Hope this helps!
What about
r = re.compile(r'<br>\((\d+)%\)</TD>', re.I)
m = r.search(text)
print m.group(1)
Here you can search for this exact pattern throughout the text and get it along with its "environment" (the match). From the match, you now extract the group #1 (the () enclosed part).
You as well can search without a compiled re object:
m = re.search(r'<br>\((\d+)%\)</TD>', text, re.I)
print m.group(1)
you can try this
import re
html = getHtmlTextFromWebsite()
number = re.findall(r'<br>(.*?)%</br>', html)
to print the number:
print number[0]
which prints 40 with the example text.

Categories

Resources