regular expression, negative look-around for wired email matching - python

I'm trying to make a email matcher, since there are so many things like this:
https://site_1.com#site_2.com/xxxxx
I decided to use a negative to get rid of these. My attempt is as follow:
regex = r"([a-zA-Z0-9\._-]+(?!https?://.*)#[a-zA-Z0-9\._-]\.[a-zA-Z0-9])"
My idea is, the negative look-around will fail to match anything with a https://xxxxx#, but clearly I'm wrong. I did the following:
email_search = re.compile(regex)
email_search.search("https://siteA.com#siteB.com")
And the result is a match, the matched string is //siteA.com#siteB.com
I sort of have to use re.search because I'm working with obfuscated text, but the negative look ahead should do the trick in my understanding, please show me what I did wrong and how to do it correctly, any help is appreciated!

Use negative look-aheads to prevent certain inputs from matching (i.e. "preconditions"):
regex = r"(?!https?://)<actual email regex here>"
You can chain them:
regex = r"(?!<exclude this>)(?!<exclude that>)(?!<and that>)<actual regex here>"
Apart from that - so, so, so many email matching regexes have been made by now that I would discourage you from inventing yet another one. Pick one from the pile.
The better ones would not allow things like https://site_1.com#site_2.com/xxxxx from the start, so you would not have to work around defects in your own creation.

Related

Positive Lookbehind Stripping Out Metacharacters

I need to get the sequence at the end of many urls to label csv files. The approach I have taken gives me the result I want, but I am struggling to understand how I might use a positive lookbehind to capture all the characters after the word 'series' in the url while ignoring any metacharacters? I know I can use re.sub() to delete them, however, I am interested in learning how I can complete the whole process in one regex.
I have searched through many posts on how I might do this, and experimented with lots of different approaches but I haven't been able to figure it out. Mainly with replacing the .+ after the (?<=series\-) with something to negate that - but it hasn't worked.
url = 'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a'
res = re.search(r"(?<=series\-).+", url).group(0)
re.sub('-', '', res)
Which gives the desired result 'kbw10a'
Is it possible to strip out the metacharacter '-' in the positive lookbehind? Is there a better approach to this without the lookaround?
More examples;
'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1014416/yanmar-marine-marine-main-engine-small-qm-series-kbw10',
'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a',
'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1018923/yanmar-marine-marine-main-engine-small-qm-series-kh18-a',
You cannot "ignore" chars in a lookaround the way you describe, because in order to match a part of a string, the regex engine needs to consume the part, from left to right, matching all subsequent subpatterns in your regex.
The only way to achieve that is through additional step, removing the hyphens once the match is found. Note that you do not need another regex to remove hyphens, .replace('-', '') will suffice:
url = 'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a'
resObj = re.search(r"series-(.+)", url)
if resObj:
res = resObj.group(1).replace('-', '')
Note it is much safer to first run re.search to get the match data object and then access the .group(), else, when there is no match, you may get an exception.
Also, there is no need of any lookarounds in the pattern, a capturing group will work as well.

Regex lookbehind and lookahead doesn't find any match

I have a lot of data that I need to parse and output in different format. The data looks something like this:
tag="001">utb20181009818<
tag="003">CZ PrNK<
...
And now, I want to extract 'utb20181009818' after after 'tag="001">' and before the last '<'
This is my code in python:
regex_pattern = re.compile(r'''(?=(tag="001(.*?)">)).*?(?<=[<])''')
ID = regex_pattern.match(one_line)
print(ID)
My variable one_line already contains the necessary data and I just need to extract the value, but it doesn't seem to match no matter what I do. I looked at it for hours, but doesn't seem to find out what I'm doing wrong.
Try regex tag=\"001\">(.*?)< and capture the first group ID.group(1)
Regex
The issue is that lookaheads don't move the match position to the right because they don't match anything - they only look.
Obviously, utilizing a match group as suggested would be the simplest way to go here, as you wouldn't have to take pains to avoid matching the parts you don't want.
But if your "001" isn't variable length, I think what you want is actually a lookbehind/lookahead (not lookahead/lookbehind):
(?<=tag="001">).*(?=<)
https://regex101.com/r/rMQnna/3/

Improving accuracy/brevity of regex for inconsistent url filtering

So, for some lulz, a friend and I were playing with the idea of filtering a list (100k+) of urls to retrieve only the parent domain (ex. "domain.com|org|etc"). The only caveat is that they are not all nice and matching in format.
So, to explain, some may be "http://www.domain.com/urlstuff", some have country codes like "www.domain.co.uk/urlstuff", while others can be a bit more odd, more akin to "hello.in.con.sistent.urls.com/urlstuff".
So, story aside, I have a regex that works:
import re
firsturl = 'www.foobar.com/fizz/buzz'
m = re.search('\w+(?=(\..{3}/|\..{2}\..{2}/))\.(.{3}|.{2}\..{2})', firsturl)
m.group(0)
which returns:
foobar.com
It looks up the first "/" at the end of the url, then returns the two "." separated fields before it.
So, my query, would anyone in the stack hive mind have any wisdom to shed on how this could be done with better/shorter regex, or regex that doesn't rely on a forward lookup of the "/" within the string?
Appreciation for all of the help in this!
I do think that regex is just the right tool for this. Regex is pattern matching, which is put to best use when you have a known pattern that might have several variations, as in this case.
In your explanation of and attempted solution to the problem, I think you are greatly oversimplifying it, though. TLDs come in many more flavors than "2-digit country codes" and "3-digit" others. See ICANN's list of top-level domains for the hundreds currently available, with lengths from 2 digits and up. Also, you may have URLs without any slashes and some with multiple slashes and dots after the domain name.
So here's my solution (see on regex101):
^(?:https?://)?(?:[^/]+\.)*([^/]+\.[a-z]{2,})
What you want is captured in the first matching group.
Breakdown:
^(?:https?://)? matches a possible protocol at the beginning
(?:[^/]+\.)* matches possible multiple non-slash sequences, each followed by a dot
([^/]+\.[a-z]{2,}) matches (and captures) one final non-slash sequence followed by a dot and the TLD (2+ letters)
You can use this regex instead:
import re
firsturl = 'www.foobar.com/fizz/buzz'
domain = re.match("(.+?)\/", firsturl).group()
Notice, though, that this will only work without 'http://'.

Regular expression code is not working (Python)

Assume I have a word AB1234XZY or even 1AB1234XYZ.
I want to extract ONLY 'AB1234' or 1AB1234 (ie. everything up until the letters at the end).
I have used the following code to extract that but it's not working:
base= re.match(r"^(\D+)(\d+)", word).group(0)
When I print base, it's not working for the second case. Any ideas why?
Your regex doesn't work for the second case because it starts with a number; the \D at the beginning of your pattern matches anything that ISN'T a number.
You should be able to use something quite simple for this--simpler, in fact, than anything else I see here.
'.*\d'
That's it! This should match everything up to and including the last number in your string, and ignore everything after that.
Here's the pattern working online, so you can see for yourself.
(.+?\d+)\w+ would give you what you want.
Or even something like this
^(.+?)[a-zA-Z]+$
re.match starts at the beginning of the string, and re.search simply looks for it in the string. both return the first match. .group(0) is everything included in the match, if you had capturing groups, then .group(1) is the first group...etc etc... as opposed to normal convention where 0 is the first index, in this case, 0 is a special use case meaning everything.
in your case, depending on what you really need to capture, maybe using re.search is better. and instead of using 2 groups, you can use (\D+\d+) keep in mind, it will capture the first (non-digits,digits) group. it might be sufficient for you, but you might want to be more specific.
after reading your comment "everything before the letters at the end"
this regex is what you need:
regex = re.compile(r'(.+)[A-Za-z]')

Conditional Regular Expressions

I'm using Python and I want to use regular expressions to check if something "is part of an include list" but "is not part of an exclude list".
My include list is represented by a regex, for example:
And.*
Everything which starts with And.
Also the exclude list is represented by a regex, for example:
(?!Andrea)
Everything, but not the string Andrea. The exclude list is obviously a negation.
Using the two examples above, for example, I want to match everything which starts with And except for Andrea.
In the general case I have an includeRegEx and an excludeRegEx. I want to match everything which matchs includeRegEx but not matchs excludeRegEx. Attention: excludeRegEx is still in the negative form (as you can see in the example above), so it should be better to say: if something matches includeRegEx, I check if it also matches excludeRegEx, if it does, the match is satisfied. Is it possible to represent this in a single regular expression?
I think Conditional Regular Expressions could be the solution but I'm not really sure of that.
I'd like to see a working example in Python.
Thank you very much.
Why not put both in one regex?
And(?!rea$).*
Since the lookahead only "looks ahead" without consuming any characters, this works just fine (well, this is the whole point of lookaround, actually).
So, in Python:
if re.match(r"And(?!rea$).*", subject):
# Successful match
# Note that re.match always anchor the match
# to the start of the string.
else:
# Match attempt failed
From the wording of your question, I'm not sure if you're starting with two already finished lists of "match/don't match" pairs. In that case, you could simply combine them automatically by concatenating the regexes. This works just as well but is uglier:
(?!Andrea$)And.*
In general, then:
(?!excludeRegex$)includeRegex

Categories

Resources