Regex to Catch Url

Regex to Catch Url - python

I have an url: http://200.73.81.212/.CREDIT-UNION/update.php None of reg expressions I've found and develop myself works. I'm working on phishing mails dataset and there are lots of strange hyperlinks. This is one of mine:
https?:\/\/([a-zA-z0-9]+.)+)|(www.[a-zA-Z0-9]+.([a-zA-Z0-9]+\.[a-zA-Z0-9]+)+)(((/[\.A-Za-z0-9]+))+/?.
Of course no success. I work in Python.
EDIT:
I need a regex to catch this kind of url's and, also, any ordinary hyperlinks, like:
https://cnn.com/
www.foxnews.com/story/122345678
Any thoughts?

What about something like this?
import re
phish = re.compile('''(?P<http>http\://)
(?P<ipaddress>(([0-9]*(\.)?)[0-9]*)*)/\.
(?P<name>(\.)?([A-Za-z]*)(\-)?([A-Za-z]*))/
(?P<ending>(update\.php))''', re.VERBOSE)
example_string = 'http://200.73.81.212/.CREDIT-UNION/update.php'
found_matches = []
# check that matches actually exist in input string
if phish.search(example_string):
# in case there are many matches, iterate over them
for mtch in phish.finditer(example_string):
# and append matches to master list
found_matches.append(mtch.group(0))
print(found_matches)
# ['http://200.73.81.212/.CREDIT-UNION/update.php']
This is flexible enough so now in case you have alternate endings than update.php, you can simply include them in the named capture group by separating all alternate ending matches with |, i.e.
(update\.php|remove\.php, ...)
Furthermore, your ip address named capture group can take any number of 123.23.123.12, it doesnt have to be a fixed number of repeating numbers followed by period patterns. Now I believe IP addresses cap out at 3 numbers, so you could anchor those down to make sure you are matching the right types of numbers with curly brackets:
[0-9]{2, 3}\. # minimum of 2 numbers, maximum of 3

While #datawrestler answer works for original question, I had to extend it to catch wider group of url's (I've edited the question). This url, seems to work, for the task:
r"""(https?://www\.[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.#-]+){0,20})|\
(https?://[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.#-]+){0,20})|\
(www.[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.#-]+){0,20})"""
Three alternatives: https?://www, https://domain, www.domain

Related

Python regex to exclude several words

I try to search for URLS and want to exclude some. In the variable download_artist I stored the base URL and wanto to find additional links, but not upload, favorites, followers or listens.
So I tried different versions with the mentioned words and a |. Like:
urls = re.findall(rf'^{download_artist}uploads/|{download_artist}^favorites/|^{download_artist}followers/|^{download_artist}listens/|{download_artist}\S+"', response.text, re.IGNORECASE)
or:
urls = re.findall(rf'{download_artist}^uploads/|^favorites/|^followers/|^listens/|\S+"', response.text, re.IGNORECASE)
But it ignores my ^ for excluding the words. Where is my mistake?

You need use "lookaround" in this case, can see more details in https://www.regular-expressions.info/lookaround.html.
So, i think wich this regex solve your problem:
{download_artist}(?!uploads/|favorites/|followers/|listens/)\S+\"
You can test if regex working in https://regex101.com/. This site is very useful when you work with regex.

^ only works as a negation in character classes inside [], outside it represents the beginning of the input.
I suggest you do two matches: One to match all urls and another one to match the ones to exclude. Then remove the second set of urls from the first one.
That will keep the regexes simple and readable.
If you have to do it in one regex for whatever reason you can try to solve it with (negative) lookaround pattern (see https://www.rexegg.com/regex-lookarounds.html).

Capture string between \n [string] \n

I'm trying to parse YouTube description's of songs to compile into a .csv
Currently I can isolate timecodes, though making an attempt on isolating the song and artist is proving trickier.
First, I catch the whitesapce
# catches whitespace
pattern = re.compile(r'\s+')
Second, the timecodes (to make the string simpler to deal with)
# catches timecodes
pattern1 = re.compile(r'[\d\.-]+:[\d.-]+:[\d\.-]+')
then I sub and remove.
I then try to capture all strings between \n, as this is how the tracklist is formatted
songBeforeDash = re.search(r'^([\\n][a-zA-Z0-9]*-[a-zA-Z0-9]*[\\n]*)+$', description)
The format follows \n[string]-[string]\n
Using this excellent visualiser , I've been able to tweak it so it catches the first result, however any subsequent results don't match.
Is this a case of stopping at the first result and not catching the others?
Here's a sample of what I'm trying to catch
\nmiddleschoolxAso-Cypress\nShopan-Woodnot\nchromonicci-Memories.\nYasper-MoveTogether\nFenickxDelayde-Longwayhome\nauv-Rockaway5pm\nsadtoi-Aires\nGMillsxKyleMcEvoy-Haze\nRuckP-CoffeeBreak\n

You can do that with split()
t = '\nmiddleschoolxAso-Cypress\nShopan-Woodnot\nchromonicci-Memories.\nYasper-MoveTogether\nFenickxDelayde-Longwayhome\nauv-Rockaway5pm\nsadtoi-Aires\nGMillsxKyleMcEvoy-Haze\nRuckP-CoffeeBreak\n'
liste = t.split('\n')
liste = liste[1:-1:]
print(liste)

re.search only returns the first match in the string.
What you want is to use re.findall which returns all matches.
EDIT - Because your matches would overlap, I would suggest editing the regex to capture until the next newline. Right now they cannot overlap. Consider changing the regex to this:
r'^([\\n][a-zA-Z0-9]*-[a-zA-Z0-9]*)+$'
If what you want is for them to overlap (meaning you want to capture the newlines too), I suggest looking here to see how to capture overlapping regex patterns.
Also, as suggested by #ChatterOne, using the str.split(seperator) method would work well here, assuming no other type of information is present.
descriptor.split('\n')

Python re.findall hangs on certain websites

I´ve got a python script to loop through a list of websites/domains to scrape phones and e-mails from my clients websites, 99% of websites scrapes are OK and works. Some websites just hangs and cant even force break operation, like it is on an insane loop. Below an example. Anyone could help me improve or fix this?
import requests,re
try:
r = requests.Session()
f = r.get('http://www.poffoconsultoria.com.br', verify=False, allow_redirects=False,timeout=(5,5) )
s = f.text
tels = set(re.findall(r"\s?\(?0?[1-9][1-9]\)?[-\.\s][2-5]\d{3}\.?-?\s?\d{4}",s))
emails = set(re.findall(r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",s))
print(tels)
print(emails)
except Exception as e:
print(e)

You should remove the \s? from the first regex (you do not really need a whitespace at the start of the match), or replace with (?<!\S) if you want to only match after a whitespace or start of string.
The real problem is with the second regex where . resides in a character class that is quantified with +. The \. that follows it also matches a . and that makes it a problem when no matching text appears in the string. This is catastrophic backtracking.
Since the matches you expect are whole words, I suggest enhancing the pattern by 1) adding word boundaries, 2) making all adjoining subpatterns match different types of chars.
Use
r'\b[A-Za-z0-9._%+-]+#(?:[A-Za-z0-9-]+\.)+[A-Za-z]{2,4}\b'
to match emails.
See the (?:[A-Za-z0-9-]+\.)+ part: it matches one or more repetitions of 1 or more alphanumeric/hyphen chars followed with a dot, and there is no \. after this pattern, there is an alpha character class, so there should be no problem like the one present before.

So. I got the website data fine in Python27 using >>> string = requests.get('http://www.poffoconsultoria.com.br').text
I then took the length of the string and it was >>> len(strings)
474038 That's a really high value.
So for problems like these when one sees regex take such a long time (really, after getting the length of the page), you should visit the page in your browser and inspect the page source
When I inspected the page in my browser I found these:
The second regex [A-Za-z0-9._%+-]+ will definitely hang (really, take a long time) because it isn't quantifiable and has to search through those ginormous portions.
You either need to chunk the page or limit your regex. Or you could write a function that discards dictionary data if you suspect that what you need to return won't appear inside of them; basically though, those huge dictionaries above are causing the regex you posted to take a long time.

Use valid email
(?i)(?:("[^"\\]*(?:\\.[^"\\]*)*"#)|((?:[0-9a-z](?:\.(?!\.)|[-!#$%&'*+/=?^`{}|~\w])*)?[0-9a-z]#))(?:(\[(?:\d{1,3}\.){3}\d{1,3}\])|((?:[0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][-a-z0-9]{0,22}[a-z0-9]))

Improving accuracy/brevity of regex for inconsistent url filtering

So, for some lulz, a friend and I were playing with the idea of filtering a list (100k+) of urls to retrieve only the parent domain (ex. "domain.com|org|etc"). The only caveat is that they are not all nice and matching in format.
So, to explain, some may be "http://www.domain.com/urlstuff", some have country codes like "www.domain.co.uk/urlstuff", while others can be a bit more odd, more akin to "hello.in.con.sistent.urls.com/urlstuff".
So, story aside, I have a regex that works:
import re
firsturl = 'www.foobar.com/fizz/buzz'
m = re.search('\w+(?=(\..{3}/|\..{2}\..{2}/))\.(.{3}|.{2}\..{2})', firsturl)
m.group(0)
which returns:
foobar.com
It looks up the first "/" at the end of the url, then returns the two "." separated fields before it.
So, my query, would anyone in the stack hive mind have any wisdom to shed on how this could be done with better/shorter regex, or regex that doesn't rely on a forward lookup of the "/" within the string?
Appreciation for all of the help in this!

I do think that regex is just the right tool for this. Regex is pattern matching, which is put to best use when you have a known pattern that might have several variations, as in this case.
In your explanation of and attempted solution to the problem, I think you are greatly oversimplifying it, though. TLDs come in many more flavors than "2-digit country codes" and "3-digit" others. See ICANN's list of top-level domains for the hundreds currently available, with lengths from 2 digits and up. Also, you may have URLs without any slashes and some with multiple slashes and dots after the domain name.
So here's my solution (see on regex101):
^(?:https?://)?(?:[^/]+\.)*([^/]+\.[a-z]{2,})
What you want is captured in the first matching group.
Breakdown:
^(?:https?://)? matches a possible protocol at the beginning
(?:[^/]+\.)* matches possible multiple non-slash sequences, each followed by a dot
([^/]+\.[a-z]{2,}) matches (and captures) one final non-slash sequence followed by a dot and the TLD (2+ letters)

You can use this regex instead:
import re
firsturl = 'www.foobar.com/fizz/buzz'
domain = re.match("(.+?)\/", firsturl).group()
Notice, though, that this will only work without 'http://'.

How to have 2 or more regular expressions with exclusive OR in Python?

I have a regex to find an email mailto:(.*)\"|(\S+#+\S*)|(.{1,40}#.{1,40}) on several HTML sources.
Having the string Email: <u><a href="mailto:test#test.com">email me the piece mailto:(.*)\" works great,
I would want it to stop there and return the value instead of continuing with the other expressions. Is there something like an XOR operator or a way to do this since I would have to add more pieces eventually?
I tried here: http://pythex.org/

Regex should do this naturally. To illustrate this point, one of the easy ways of matching a word with one exception is to precede your match with the exception and use an alternation.
For instance in my email program I need to sort all emails with the subject line /labels?/ to another folder. However some of my contacts never learned how to spell (apparently) and I also sort /lables?/ However I noticed that emails containing the subject line Available were also being picked up by this filter.
I could have done /\blables?/ but preferred instead to catch available and handle it separately, so I did:
`/available|(lables?)|(labels?)/`
This alternation will match available, but only match and capture lable, lables, label, or labels.

As Adam Smith said, the problem you're having is that your last alternation .{1,40}#.{1,40} begins matching well before the rest of your alternations, so it consumes the text and that's the match that returns.
To overcome this, you could modify the other partial patterns so that they match just as early, by prepending .*? e. g. .*?mailto:(.*)\"|.*?(\S+#+\S*)|(.{1,40}#.{1,40}).
Or, perhaps somewhat less convoluted, you could just search for one after the other:
string = 'Email: <u><a href="mailto:test#test.com">email me'
m = re.search('mailto:(.*)\"', string) \
or re.search('(\S+#+\S*)', string) \
or re.search('(.{1,40}#.{1,40})', string)
print(m.group(1))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex to Catch Url - python

Related

Python regex to exclude several words

Capture string between \n [string] \n

Python re.findall hangs on certain websites

Improving accuracy/brevity of regex for inconsistent url filtering

How to have 2 or more regular expressions with exclusive OR in Python?

Categories

Resources