I'm working on a scraper/web crawler using Python 3.5 and the re module where one of its functions requires retrieving a YouTube channel's URL. I'm using the following portion of code that includes the matching of regular expression to accomplish this:
href = re.compile("(/user/|/channel/)(.+)")
What it should return is something like /user/username or /channel/channelname. It does this successfully for the most part, but every now and then it grabs a type of URL that includes more information like /user/username/videos?view=60 or something else that goes on after the username/ portion.
In an attempt to adress this issue, I rewrote the bit of code above as
href = re.compile("(/user/|/channel/)(?!(videos?view=60)(.+)")
along with other variations with no success. How can I rewrite my code so that it fetches URLS that do not include videos?view=60 anywhere in the URL?
Use the following approach with a specific regex pattern:
user_url = '/user/username/videos?view=60'
channel_url = '/channel/channelname/videos?view=60'
pattern = re.compile(r'(/user/|/channel/)([^/]+)')
m = re.match(pattern, user_url)
print(m.group()) # /user/username
m = re.match(pattern, channel_url)
print(m.group()) # /channel/channelname
I used This approach and it seems it does what you want.
import re
user = '/user/username/videos?view=60'
channel = '/channel/channelname/videos?view=60'
pattern = re.compile(r"(/user/|/channel/)[\w]+/")
user_match = re.search(pattern, user)
if user_match:
print user_match.group()
else:
print "Invalid Pattern"
pattern_match = re.search(pattern,channel)
if pattern_match:
print pattern_match.group()
else:
print "Invalid pattern"
Hope this helps!
Related
So I have the following text example:
Good Morning,
The link to your exam is https://uni.edu?hash=89234rw89yfw8fw89ef .Please complete it within the stipulated time.
If you have any issue, please contact us
https://www.uni.edu
https://facebook.com/uniedu
And what I want is to extract the url of the exam link: https://uni.edu?hash=89234rw89yfw8fw89ef . I'm planning to use the findAll() function but I'm having difficulties writing the regex to extract the specific url.
import re
def find_exam_url(text_file):
filename = open(text_file, "r")
new_file = filename.readlines()
word_lst = []
for line in new_file:
exam_url = re.findall('https?://', line) #use regex to extract exam url
return exam_url
if __name__ == "__main__":
print(find_exam_url("mytextfile.txt"))
The output i get is:
['http://']
Instead of:
https://uni.edu?hash=89234rw89yfw8fw89ef
Would appreciate some help on this.
This regex works:
>>> re.findall('(https?://.*?)\s', s)
['https://uni.edu?hash=89234rw89yfw8fw89ef',
'https://www.uni.edu',
'https://facebook.com/uniedu']
where s represents the text in your file (read by f.read()) and the pattern used is (https?://.*?)\s (lazy match until whitespace occurs).
If you need to extract the url mentioned as an exam link, you can make the regex more specific:
>>> re.findall('exam.*(https?://.*?)\s', s)
['https://uni.edu?hash=89234rw89yfw8fw89ef']
Or it seems like the exam link would have an identifier/URL parameter in the form of ?hash=, so something like this is better
>>> re.findall('(https?://.*\?hash=.*?)\s', s)
['https://uni.edu?hash=89234rw89yfw8fw89ef']
I am trying to match or find coincidence a string in python with regex method re.search() without lucky
this is my code:
import re
request_path = '/colpos/papanicolaou2/124579/1254'
urls = ['/colpos/prescription', '/colpos/transfer', '/colpos/papanicolaou2', '/colpos/biopsia']
for url in urls:
c_url = re.compile(url)
result = re.search(c_url, request_path)
if isinstance(result, re.Match):
allowed_url = url
break
print(allowed_url) # must be /colpos/papanicolau2
what I want to happen?, if url is in request_path (in this case partially) I expect that result been re.Match object instance not None.
how can I achive this?, is any better way to know if my request_path is in urls?
the code mentioned above only works if url and request_path contains exactly the same, I dont want that. How should I use re.search() in python to achive this?
thank you
I tried checking it with the "in" keyword instead of using re module. I think it is simpler and more readable.
request_path = '/colpos/papanicolaou2/124579/1254'
urls = ['/colpos/prescription', '/colpos/transfer', '/colpos/papanicolaou2', '/colpos/biopsia']
allowed_urls = []
for url in urls:
if url in request_path:
allowed_urls.append(url)
print(allowed_urls) # this contains '/colpos/papanicolaou2' like you wanted
In case you just got 2 fixed (real) parts for your request_path, you could the following (no loops, no regex - just Python):
/colpos/papanicolaou2/124579/1254
/part_1/part_2 /param1/param2/...
Code:
urls = ['colpos/prescription', 'colpos/transfer', 'colpos/papanicolaou2', 'colpos/biopsia']
request_path = "/colpos/papanicolaou2/124579/1254"
p1, p2, params = request_path[1:].split('/', 2)
if '/'.join([p1, p2]).lower() not in urls:
#raise Error(404)
print("url not found")
Note: You would need to make it more stable for production usage :)
I am trying to normalize a URL, to extract the content after :// and before the last / at the end of line if it exists.
Here is my script:
url = "https://example.com/25194425/"
matchUrl = re.findall(r'://(.*)/?$', url)
print matchUrl
What I want is example.com/25194425, but I get example.com/25194425/. How to deal with the last /?
Why doesn't /? work?
An alternative way to do it without using regex is using urlparse
>>> from urlparse import urlparse
>>> url = 'https://example.com/25194425/'
>>> '{url.netloc}{url.path}'.format(url=urlparse(url)).rstrip('/')
'example.com/25194425'
Later on, if you want to include the protocol, port, params, ... parts into the normalized url. It can be done easier (than updating the regex)
>>> '{url.scheme}://{url.netloc}{url.path}'.format(url=urlparse(url)).rstrip('/')
'https://example.com/25194425'
As one of the commenters said, you just need to make the quantifier non-greedy:
://(.*?)/?$
However, the result of findall() is a list, not a string. In this case it's list with only one entry, but it's still a list. To get the actual string, you need to provide the index:
url = "https://example.com/25194425/"
match = re.findall(r'://(.*?)/?$', url)
print match[0]
But that seems like an inappropriate use of findall() to me. I would have gone with search():
url = "https://example.com/25194425/"
match = re.search(r'://(.*?)/?$', url)
if match:
print match.group(1)
The default is possible because the regular match more characters. So '(.*) /' will match to the last slash.
You can use it:
matchUrl = re.findall(r'://(.*)/[^/]?$', url)
EDIT Please try the following pattern (python 2.7x):
import re
url1 = 'https://example.com/25194425?/'
url2 = 'https://example.com/25194425?'
print re.findall('https?://([\S]+)(?<!/)[/]?', url1)
print re.findall('https?://([\S]+)(?<!/)[/]?', url2)
Output:
['example.com/25194425?']
['example.com/25194425?']
Thanks #Alan Moore for pointing out the word boundary issue. Now it should work for both scenarios.
I started learning Python earlier today and as my first project I wanted to make a script that shows me today's weather forecast.
My script:
import urllib2, re
url = urllib2.urlopen('http://www.wetter.com/wetter_aktuell/wettervorhersage/heute /deutschland/oberhausen/DE0007740.html')
html = url.read()
url.close()
x = re.search("""<dl><dd><strong>(?P<uhrzeit>.*)""", html, re.S)
x = re.search("""<dd><span class="degreespan" style="font-weight:normal;">(?P<temp>.*)""", html, re.S)
print x.group('uhrzeit'), x.group('temp')
I used this as template. When I run this script I get an Index Error no such groups
You are overwriting x.
Maybe you want:
x = re.search("""<dl><dd><strong>(?P<uhrzeit>.*)""", html, re.S)
y = re.search("""<dd><span class="degreespan" style="font-weight:normal;">(?P<temp>.*)""", html, re.S)
print x.group('uhrzeit'), y.group('temp')
And I can't belive that the site you linked advocates using regular expressions for extracting information from HTML.
i needed help. How do i get domain from a string?
For example: "Hi im Natsume, check out my site http://www.mysite.com/"
How do i get just mysite.com?
Output example:
http://www.mysite.com/ (if http entered)
www.mysite.com (if http not entered)
mysite.com (if both http and www not entered)
myString = "Hi im Natsume, check out my site http://www.mysite.com/"
>>> a = re.search("(?P<url>https?://[^\s]+)", myString) or re.search("(?P<url>www[^\s]+)", myString)
>>> a.group("url")
'http://www.mysite.com/'
>>> myString = "Hi im Natsume, check out my site www.mysite.com/"
>>> a = re.search("(?P<url>https?://[^\s]+)", myString) or re.search("(?P<url>www[^\s]+)", myString)
>>> a.group("url")
'www.mysite.com/'
Well ... You need some way to define what you consider to be something that has a "domain". One approach might be to look up a regular expression for URL-matching, and apply that to the string. If that succeeds, you at least know that the string holds a URL, and can continue to interpret the URL in order to look for a host name, from which you can then extract the domain (possibly).
s= "Hi im Natsume, check out my site http://www.mysite.com/"
start=s.find("http://") if s.find("http://")!=-1 else s.find("https://")+1
t = s[start+11:s.find(" ",start+11)]
print(t)
output:
mysite.com
If you want to use regular expression, one way could be -
>>> s = "Hi im Natsume, check out my site http://www.mysite.com/"
>>> re.findall(r'http\:\/\/www\.([a-zA-Z0-9\.-_]*)\/', s)
['mysite.com']
..considering url ends with '/'
If all the sites had the same format, you could use a regexp like this (which work in this specific case):
re.findall('http://www\.(\w+)\.com', url)
However you need a more complex regexp able to parse whichever url and extract the domain name.
Best way is to use regex to extract the URL. Then use tldextract to get valid domain name from the URL.
import re
import tldextract
text = "Hi im Natsume, check out my site http://www.example.com/"
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
found_url = urls[0]
info = tldextract.extract(found_url)
domain_name = info.domain
suffix_name = info.suffix
final_domain_name = domain_name+"."+suffix_name
print(final_domain_name)
How about this?
url='https://www.google.com/'
var=url.split('//www.')[1]
domain=var[0:var.index('/')]
print(domain)