regular expression to extract part of email address - python

I am trying to use a regular expression to extract the part of an email address between the "#" sign and the "." character. This is how I am currently doing it, but can't get the right results.
company = re.findall('^From:.+#(.*).',line)
Gives me:
['#iupui.edu']
I want to get rid of the .edu

To match a literal . in your regex, you need to use \., so your code should look like this:
company = re.findall('^From:.+#(.*)\.',line)
# ^ this position was wrong
See it live here.
Note that this will always match the last occurrence of . in your string, because (.*) is greedy. If you want to match the first occurence, you need to exclude any . from your capturing group:
company = re.findall('^From:.+#([^\.]*)\.',line)
See a demo.

You can try this:
(?<=\#)(.*?)(?=\.)
See a demo.

A simple example would be:
>>> import re
>>> re.findall(".*(?<=\#)(.*?)(?=\.)", "From: atc#moo.com")
['moo']
>>> re.findall(".*(?<=\#)(.*?)(?=\.)", "From: atc#moo-hihihi.com")
['moo-hihihi']
This matches the hostname regardless of the beginning of the line, i.e. it's greedy.

You could just split and find:
s = " abc.def#ghi.mn I"
s = s.split("#", 1)[-1]
print(s[:s.find(".")])
Or just split if it is not always going to match your string:
s = s.split("#", 1)[-1].split(".", 1)[0]
If it is then find will be the fastest:
i = s.find("#")
s = s[i+1:s.find(".", i)]

Related

Inconsistency between regex and python search

I'm doing a small regex that catch all the text before the numbers.
https://regex101.com/r/JhIiG9/2
import re
regex = "^(.*?)(\d*([-.]\d*)*)$"
message = "Myteeeeext 0.366- 0.3700"
result = re.search(regex, message)
print(result.group(1))
https://www.online-python.com/a7smOJHBwp
When I run this regex instead of just showing the first group which is Myteeeeext I'm getting Myteeeeext 0.366- but in regex101 it shows only
Try this Regex, [^\d.-]+
It catches all the text before the numbers
import re
regex = "[^\d.-]+"
message = "Myteeeeext 0.366- 0.3700 notMyteeeeext"
result = re.search(regex, message)
print(f"'{result.group()}'")
Outputs:
'Myteeeeext '
tell me if its okay for you...
Your regex:
regex = "^(.*?)(\d*([-.]\d*)*)$"
doesn't allow for the numbers part to have any spaces, but your search string:
message = "Myteeeeext 0.366- 0.3700"
does have a space after the dash, so this part of your regex:
(.*?)
matches up to the second number.
It doesn't look like your test string in the regex101.com example you gave has a space, so that's why your results are different.

How can I find all paths in javascript file with regex in Python?

Sample Javascript (content):
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("src","/cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray="+e.ray),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("sdfdsfsfds",'/test/path'),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
regex = ""
endpoints = re.findall(regex, content)
Output I want:
> /cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray=
> /test/path
I want to find all fields starting with "/ and '/ with regex. I've tried many url regexes but it didn't work for me.
This should do it:
regex = r"""["']\/[^"']*"""
Note that you will need to trim the first character from the match. This also assumes that there are no quotation marks in the path.
Consider:
import re
txt = ... #your code
pat = r"(\"|\')(\/.*?)\1"
for el in re.findall(pat, txt):
print(el[1])
each el will be match of pattern starting with single, or double quote. Then minimal number of characters, then the same character as at the beginning (same type of quote).
.* stands for whatever number of any characters, following ? makes it non-greedy i.e. provides minimal characters match. Then \1 refers to first group, so it will match whatever type of quote was matched at the beginning. Then by specifying el[1] we return second group matched i.e. whatever was matched within quotes.

Python Regex to Extract Domain from Text

I have the following regex:
r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
When I apply this to a text string with, let's say,
"this is www.website1.com and this is website2.com", I get:
['www.website1.com']
['website.com']
How can i modify the regex to exclude the 'www', so that I get 'website1.com' and 'website2.com? I'm missing something pretty basic ...
Try this one (thanks #SunDeep for the update):
\s(?:www.)?(\w+.com)
Explanation
\s matches any whitespace character
(?:www.)? non-capturing group, matches www. 0 or more times
(\w+.com) matches any word character one or more times, followed by .com
And in action:
import re
s = 'this is www.website1.com and this is website2.com'
matches = re.findall(r'\s(?:www.)?(\w+.com)', s)
print(matches)
Output:
['website1.com', 'website2.com']
A couple notes about this. First of all, matching all valid domain names is very difficult to do, so while I chose to use \w+ to capture for this example, I could have chosen something like: [a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,}.
This answer has a lot of helpful info about matching domains:
What is a regular expression which will match a valid domain name without a subdomain?
Next, I only look for .com domains, you could adjust my regular expression to something like:
\s(?:www.)?(\w+.(com|org|net))
To match whichever types of domains you were looking for.
Here a try :
import re
s = "www.website1.com"
k = re.findall ( '(www.)?(.*?)$', s, re.DOTALL)[0][1]
print(k)
O/P like :
'website1.com'
if it is s = "website1.com" also it will o/p like :
'website1.com'

Use python 3 regex to match a string in double quotes

I want to match a string contained in a pair of either single or double quotes. I wrote a regex pattern as so:
pattern = r"([\"\'])[^\1]*\1"
mytext = '"bbb"ccc"ddd'
re.match(pattern, mytext).group()
The expected output would be:
"bbb"
However, this is the output:
"bbb"ccc"
Can someone explain what's wrong with the pattern above? I googled and found the correct pattern to be:
pattern = r"([\"\'])[^\1]*?\1"
However, I don't understand why I must use ?.
In your regex
([\"'])[^\1]*\1
Character class is meant for matching only one character. So your use of [^\1] is incorrect. Think, what would have have happened if there were more than one characters in the first capturing group.
You can use negative lookahead like this
(["'])((?!\1).)*\1
or simply with alternation
(["'])(?:[^"'\\]+|\\.)*\1
or
(?<!\\)(["'])(?:[^"'\\]+|\\.)*\1
if you want to make sure "b\"ccc" does not matches in string bb\"b\"ccc"
You should use a negative lookahead assertion. And I assume there won't be any escaped quotes in your input string.
>>> pattern = r"([\"'])(?:(?!\1).)*\1"
>>> mytext = '"bbb"ccc"ddd'
>>> re.search(pattern, mytext).group()
'"bbb"'
You can use:
pattern = r"[\"'][^\"']*[\"']"
https://regex101.com/r/dO0cA8/1
[^\"']* will match everything that isn't " or '

Split string via regular expression

Suppose I am given a string like:
input = """
abc#gmail.com is a very nice person
xyz#gmail.com sucks
lol#gmail.com is pretty funny."""
I have a regular expression for email addresses: ^[A-z0-9\+\.]+\#[A-z0-9\+\.]+\.[A-z0-9\+]+$
The goal is to split the string based on the email address regular expression.
The output should be:
["is a very nice person", "sucks", "is pretty funny."]
I have been trying to use re.split(EMAIL_REGEX, input) but i haven't been successful.
I get the output as the entire string contained in the list.
Remove the ^ and $ anchors, as they only match the beginning and end of the string. Since the email addresses are in the middle of the string, they'll never match.
Your regexp has other problems. The account name can contain many other characters than the ones you allow, e.g. _ and -. The domain name can contain - characters, but not +. And you shouldn't use the range A-z to get upper and lower case characters, because there are characters between the two alphabetic blocks that you probably don't want to include (see the ASCII Table); either use A-Za-z or use a-z and add flags = re.IGNORECASE.
The '^$' might be throwing it off. It'll only match string that starts and ends with the matching regex.
I have something close to what you want:
>>> EMAIL_REGEX = r'[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}'
>>> re.split(EMAIL_REGEX, input, flags=re.IGNORECASE)
['\n', ' is a very nice person\n', ' sucks\n', ' is pretty funny.']
You will probably need to loop through the lines and then split each with your regex.
Also your regex shouldn't have $ at the end.
Try something like:
EMAIL_REGEX = r"\.[a-z]{3} " # just for the demo note the space
ends =[]
for L in input.split("\n"):
parts = re.split(EMAIL_REGEX,L)
if len(parts) > 1:
ends.append(parts[1])
Output:
['is a very nice person', 'sucks', 'is pretty funny.']
Wouldn't use a regex here, it would work like this as well:
messages = []
for item in input.split('\n'):
item = ' '.join(item.split(' ')[1:]) #removes everything before the first space, which is just the email-address in this case
messages.append(item)
Output of messages when using:
input = """
abc#gmail.com is a very nice person
xyz#gmail.com sucks
lol#gmail.com is pretty funny."""
['', 'is a very nice person', 'sucks', 'is pretty funny.']
If you want to remove the first element, just do it like this: messages = messages[1:]

Categories

Resources