how to regex this link? - python

I want to regex a list of URLs.
The links format looks like this:
`https://en.wikipedia.org/wiki/Alexander_Pushkin'
The part I need:
en.wikipedia.org
Can you help, please?

Instead of looking for \w etc. which would only match the domain, you're effectively looking for anything up to where the URL arguments start (the first ?):
re.search(r'[^?]*', URL)
This means: from the beginning of the string (search), all characters that are not ?. A character class beginning with ^ negates the class, i.e. not matching instead of matching.
This gives you a match object, where [0] will be the URL you're looking for.

You can do that wihtout using regex by leveraging urllib.parse.urlparse
from urllib.parse import urlparse
url = "https://sales-office.ae/axcapital/damaclagoons/?cm_id=14981686043_130222322842_553881409427_kwd-1434230410787_m__g_&gclid=Cj0KCQiAxc6PBhCEARIsAH8Hff2k3IHDPpViVTzUfxx4NRD-fSsfWkCDT-ywLPY2C6OrdTP36x431QsaAt2dEALw_wcB"
parsed_url = urlparse(url)
print(f"{parsed_url.scheme}://{parsed_url.netloc}{parsed_url.path}")
Outputs
https://sales-office.ae/axcapital/damaclagoons/

Based on your example, this looks like it would work:
\w+://\S+\.\w+\/\S+\/

Based on: How to match "anything up until this sequence of characters" in a regular expression?
.+?(?=\?)
so:
re.findall(".+?(?=\?)", URL)

Related

How to use Regex to extract a string from a specific string until a specific symbol in python?

Question
Assume that I have a string like this:
example_text = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
Expectation
And I want to only extract the first url, which is
output = "https://www.example.com/link_1.html"
I think using regex to find the url start from "https" and end up '\' will be a good solution.
If so, how can I write the regex pattern?
I try something like this:
`
re.findall("https://([^\\\\)]+)", example_text)
output = ['www.example.com/link_1.html', 'www.example.com/link_2.html']
But then, I need to add "https://" back and choose the first item in the return.
Is there any other solution?
You need to tweak your regex a bit.
What you were doing before:
https://([^\\\\)]+) this matches your link but only captures the part after https:// since you used the capturing token after that.
Updated Regex:
(https\:\/\/[^\\\\)]+) this matches the link and also captures the whole token (escaped special characters to avoid errors)
In Code:
import re
input = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
print(re.findall("(https\:\/\/[^\\\\)]+)", input))
Output:
['https://www.example.com/link_1.html', "https://www.example.com/link_2.html'"]
You could also use (https\:\/\/([^\\\\)]+).html) to get the link with https:// and without it as a tuple. (this also avoids the ending ' that you might get in some links)
If you want only the first one, simply do output[0].
Try:
match = re.search(r"https://[^\\']+", example_text)
url = match.group()
print(url)
output:
https://www.example.com/link_1.html

Regex to capture url until a certain character

With a url such as
https://search.yahoo.com/search?p=Fetty+Wap&fr=fp-tts&
I am using
pat = re.compile('<a href="(https?://.*?)".*',re.DOTALL)
as a search pattern.
I want to pick any url like the yahoo url above, but I want to capture the url up to the literal ? in the actual url.
In other words I want to extract the url up to ?, knowing that all the urls I'm parsing don't have the ? character. In such a case I need to capture all of the url.
The above regex works and extracts the url but goes to the end of the url. How can I get it to stop at the first ? it encounters, and keep going to the end if it doesn't encounter a ?
Regex is really the wrong tool for the job. Doing a basic string split will get you exactly what you want.
def beforeQuestionMrk(inputStr):
return inputStr.split("?")[0]
url = "https://search.yahoo.com/sometext"
url2 = "https://search.yahoo.com/search?p=Fetty+Wap&fr=fp-tts&"
print(beforeQuestionMrk(url))
print(beforeQuestionMrk(url2))
#https://search.yahoo.com/sometext
#https://search.yahoo.com/search
If you really wanted wanted to use regex I suppose you could fo the following:
import re
def getBeforeQuestRegex(inputStr):
return re.search(r"(.+?\?|.+)", inputStr).group(0)
print(getBeforeQuestRegex("https://search.yahoo.com/search?p=Fetty+Wap&fr=fp-tts&"))
print(getBeforeQuestRegex("https://search.yahoo.com/sometext"))
#https://search.yahoo.com/search?
#https://search.yahoo.com/sometext
Bobble bubbles solution above worked very well for me;
"You can try like this by use of negated class: ]*?href="(http[^"?]+)"<- bobbles answer.
url looks like this
https://search.yahoo.com/search?p=Justin+Bieber&fr=fp-tts&fr2=p:fp,m:tn,ct:all......
or it could be something like this
https://www.yahoo.com/style/5-joyful-bob-ross-tees-202237009.html
objective was to extract full url if there was no literal ? in it, but if it did to stop just before the literal ?.
was Bobble Bubbles answer and works very cleanly, does what I wanted done, Again thank you for everyone in participating in this discussion, really appreciate it.
I agree with other answer, that using regexp here is not a solution, especially because there my be any number of parameters before opening of the <a> tag and href parameter, there can be a new line in between too.
but, answering to the initial question:
'*', '+', and '?' qualifiers are all greedy - they match as much text as possible
that's why there are non-greedy versions of them:
'*?', '+?' and '??'

extract URL from string in python

I want to extract a full URL from a string.
My code is:
import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
print re.match(r'(ftp|http)://.*\.(jpg|png)$', data)
Output:
None
Expected Output
http://www.google.com/a.jpg
I found so many questions on StackOverflow, but none worked for me.
I have seen many posts and this is not a duplicate. Please help me! Thanks.
You were close!
Try this instead:
r'(ftp|http)://.*\.(jpg|png)'
You can visualize this here.
I would also make this non-greedy like this:
r'(ftp|http)://.*?\.(jpg|png)'
You can visualize this greedy vs. non-greedy behavior here and here.
By default, .* will match as much text as possible, but you want to match as little text as possible.
Your $ anchors the match at the end of the line, but the end of the URL is not the end of the line, in your example.
Another problem is that you're using re.match() and not re.search(). Using re.match() starts the match at the beginning of the string, and re.search() searches anywhere in the string. See here for more information.
You should use search instead of match.
import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
url=re.search('(ftp|http)://.*\.(jpg|png)', data)
if url:
print url.group(0)
Find the start of the url by using find(http:// , ftp://) . Find the end of url using find(jpg , png). Now get the substring
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
start = data.find('http://')
kk = data[start:]
end = kk.find('.jpg')
print kk[0:end+4]

How to regex split, but keep the split string?

I have the following URL pattern:
http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en
I would like to get everything up until and inclusive of /watch/\d+/.
So far I have:
>>> re.split(r'watch/\d+/', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'supernatural-dub-hollywood-babylon/en']
But this does not include the split string (the string which appears between the domain and the path). The end answer I want to achieve is:
http://www.hulu.jp/watch/589851
You need to use capture group :
>>> re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/en']
As mentioned in the other answer, you need to use groups to capture the "glue" between the split strings.
I wonder though, is what you want here a split() or a search()? It looks (from the sample) that you're trying to extract from a URL everything from the first occurrence of /watch/XXX/ where XXX is 1 or more digits, to the end of the string. If that's the case, then a match/search might be more suitable, as with a split if the search regex can match multiple times you'll split into multiple groups. Ex:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/', 'watch/2342/', 'fdsaafsdf']
Which doesn't look like what you want. Instead perhaps:
result = re.search(r'(watch/\d+/)(.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groups() if result else []
which gives:
('watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
You could also use this approach combined with named groups to get extra fancy:
result = re.search(r'(?P<watchId>watch/\d+/)(?P<path>.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groupdict() if result else {}
giving:
{'path': 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', 'watchId': 'watch/589851/'}
If you're set on the split() approach, you can also set the maxsplit parameter to ensure it's only split once:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', maxsplit=1)
giving:
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf']
Personally though, I find that when parsing URL's into constituent parts the search() with named groups approach works extremely well as it allows you to name the various parts in the regex itself, and via groupdict() get a nice dictionary you can use for working with those parts.
You've surely seen the Stack Overflow don't-parse-HTML-with-regex post, yes?
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.
Well, regex can parse URLs, but trying to do so when there's a plethora of better tools is foolish.
This is what a regex for URLs looks like:
^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?$ (+ caseless flag)
It's just a mess of characters, right? Exactly!
Don't parse URLs with regex... almost.
There is one simple thing:
A path-relative URL must be zero or more path segments separated from each other by a "/".
Splitting the URL should be as simple as url.split("/").
from urllib.parse import urlparse, urlunparse
myurl = "http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en"
# Run a parser over it
parts = urlparse(myurl)
# Crop the path to UP TO length 2
new_path = str("/".join(parts.path.split("/")[:3]))
# Unparse
urlunparse(parts._replace(path=new_path))
#>>> 'http://www.hulu.jp/watch/589851'
You can try following regex
.*\/watch\/\d+
Working Demo

Python and "re"

A tutorial I have on Regex in python explains how to use the re module in python, I wanted to grab the URL out of an A tag so knowing Regex I wrote the correct expression and tested it in my regex testing app of choice and ensured it worked. When placed into python it failed:
result = re.match("a_regex_of_pure_awesomeness", "a string containing the awesomeness")
# result is None`
After much head scratching I found out the issue, it automatically expects your pattern to be at the start of the string. I have found a fix but I would like to know how to change:
regex = ".*(a_regex_of_pure_awesomeness)"
into
regex = "a_regex_of_pure_awesomeness"
Okay, it's a standard URL regex but I wanted to avoid any potential confusion about what I wanted to get rid of and possibly pretend to be funny.
In Python, there's a distinction between "match" and "search"; match only looks for the pattern at the start of the string, and search looks for the pattern starting at any location within the string.
Python regex docs
Matching vs searching
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(your_html)
for a in soup.findAll('a', href=True):
# do something with `a` w/ href attribute
print a['href']
>>> import re
>>> pattern = re.compile("url")
>>> string = " url"
>>> pattern.match(string)
>>> pattern.search(string)
<_sre.SRE_Match object at 0xb7f7a6e8>
Are you using the re.match() or re.search() method? My understanding is that re.match() assumes a "^" at the beginning of your expression and will only search at the beginning of the text, while re.search() acts more like the Perl regular expressions and will only match the beginning of the text if you include a "^" at the beginning of your expression. Hope that helps.

Categories

Resources