I am using Python and django, and I have some vimeo URLs I need to convert to their embed versions. For example, this:
https://vimeo.com/76979871
has to be converted into this:
https://player.vimeo.com/video/76979871
but Not converted
My code is below:
_vm = re.compile(
r'/(?:https?:\/\/(?:www\.)?)?vimeo.com\/(?:channels\/|groups\/([^\/]*)\/videos\/|album\/(\d+)\/video\/|)(\d+)(?:$|\/|\?)/', re.I)
_vm_format = 'https://player.vimeo.com/video/{0}'
def replace(match):
groups = match.groups()
print(_vm_format)
return _vm_format.format(groups[5])
return _vm.sub(replace, text)
The given regular expression fits several variants of Vimeo URL:
https://vimeo.com/76979871
https://vimeo.com/channels/76979871
https://vimeo.com/groups/sdf/videos/76979871
https://vimeo.com/album/12321/video/76979871
The video number, provided it is really the only thing that you need for your player, will be in capture group 1 (groups[1]) after you slightly correct the regular expression: r'(?:https?:\/\/(?:www\.)?)?vimeo.com\/(?:channels\/|groups\/(?:[^\/]*)\/videos\/|album\/(?:\d+)\/video\/|)(\d+)(?:$|\/|\?)'. All other parentheses are non-capturing groups.
If, however, the player code is different for different URL types, then you better split your regular expression in four; and there will be different replacements for each.
You have to remove \ from both the end and use capture group 3 to get video id
(?:https?:\/\/(?:www\.)?)?vimeo.com\/(?:channels\/|groups\/([^\/]*)\/videos\/|album\/(\d+)\/video\/|)(\d+)(?:$|\/|\?)
Example
import re
_vm = re.compile(
r'(?:https?:\/\/(?:www\.)?)?vimeo.com\/(?:channels\/|groups\/([^\/]*)\/videos\/|album\/(\d+)\/video\/|)(\d+)(?:$|\/|\?)', re.I)
_vm_format = 'https://player.vimeo.com/video/{0}'
def replace(match):
groups = match.groups()
return _vm_format.format(groups[2])
urls=["https://vimeo.com/76979871",
"https://vimeo.com/channels/76979871",
"https://vimeo.com/groups/sdf/videos/76979871",
"https://vimeo.com/album/12321/video/76979871"]
for u in urls:
print(_vm.sub(replace, u))
Output
https://player.vimeo.com/video/76979871
https://player.vimeo.com/video/76979871
https://player.vimeo.com/video/76979871
https://player.vimeo.com/video/76979871
Related
While parsing file names of TV shows, I would like to extract information about them to use for renaming. I have a working model, but it currently uses 28 if/elif statements for every iteration of filename I've seen over the last few years. I'd love to be able to condense this to something that I'm not ashamed of, so any help would be appreciated.
Phase one of this code repentance is to hopefully grab multiple episode numbers. I've gotten as far as the code below, but in the first entry it only displays the first episode number and not all three.
import re
def main():
pattern = '(.*)\.S(\d+)[E(\d+)]+'
strings = ['blah.s01e01e02e03', 'foo.s09e09', 'bar.s05e05']
#print(strings)
for string in strings:
print(string)
result = re.search("(.*)\.S(\d+)[E(\d+)]+", string, re.IGNORECASE)
print(result.group(2))
if __name__== "__main__":
main()
This outputs:
blah.s01e01e02e03
01
foo.s09e09
09
bar.s05e05
05
It's probably trivial, but regular expressions might as well be Cuneiform most days. Thanks in advance!
No. You can use findall to find all e\d+, but it cannot find overlapping matches, which makes it impossible to use s\d+ together with it (i.e. you can't distinguish e02 in "foo.s01e006e007" from that of "age007.s01e001"), and Python doesn't let you use variable-length lookbehind (to make sure s\d+ is before it without overlapping).
The way to do this is to find \.s\d+((?:e\d+)+)$ then split the resultant group 1 in another step (whether by using findall with e\d+, or by splitting with (?<!^)(?=e)).
text = 'blah.s01e01e02e03'
match = re.search(r'\.(s\d+)((?:e\d+)+)$', text, re.I)
season = match.group(1)
episodes = re.findall(r'e\d+', match.group(2), re.I)
print(season, episodes)
# => s01 ['e01', 'e02', 'e03']
re.findall instead of re.search will return a list of all matches
If you can make use of the PyPi regex module you could make use of repeating capture groups in the pattern, and then use .captures()
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(s\d+)(e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.captures(1)[0], m.captures(2))
Output:
s01 ['e01', 'e02', 'e03']
See a Python demo and a regex101 demo.
Or using .capturesdict () with named capture groups.
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(?P<season>s\d+)(?P<episodes>e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.capturesdict())
Output:
{'season': ['s01'], 'episodes': ['e01', 'e02', 'e03']}
See a Python demo.
Note that the notation [E(\d+)] that you used is a character class, that matches 1 or the listed characters like E ( a digit + )
I'm trying to find all cases of money values in a string called webpage.
String webpage is the text from this webpage, in my program it's just hardcoded because that's all that is needed, but I won't paste it all here.
regex = r'^[$£€]?(([\d]{1,3},([\d]{3},)*[\d]{3}|[0-9]+)(\.[0-9][0-9])?(\s?bn|\s?mil|\s?euro[s]?|\s?dollar[s]?|\s?pound[s]?|p){0,2})'
res = re.findall(regex, webpage)
print(res)
it's returning [], I expected it to return [$131bn, £100bn, $100bn, $17.4bn]
Without knowing the text it has to search, you could use the regex:
([€|$|£]+[0-9a-zA-Z\,\.]+)
to capture everything that contains €, £ or $, and then print the amount without following words or letters. See the example in action here: http://rubular.com/r/a7O7AGF9Zl.
Using this regex we get this code:
import re
webpage = '''
one
million
dollars
test123
$1bn asd
€5euro
$1923,1204bn
€1293.1205 million'''
regex = r'([€|$]+[0-9a-zA-Z\,\.]+)'
res = re.findall(regex, webpage)
print(res)
with the output:
['$1bn', '€5euro', '$1923,1204bn', '€1293.1205']
EDIT: Using the same regex on the provided website, it returns the output of:
['$131bn', '$100bn', '$17.4bn.', '$52.4bn']
If you modify the regex further to find e.g. 500million, you can add 0-9 to your first bracket, as you then search for either £, €, $ or anything that starts with 0-9.
Output of:
webpage = '''
one
million
€1293.1205 million
500million
'''
regex = r'([€|$0-9]+[0-9a-zA-Z\,\.]+)'
Therefore becomes:
['€1293.1205', '500million']
the first error on your regex is the ^ at the beginning of the string, which will only match the first character on the string, which isn't helpful when using findall.
Also you are defining a lot of groups (()) , that I assume you don't really need, so escape all of them (adding ?: next to the opened parenthesis) and you are going to get very close to what you want:
regex = r'[$£€](?:(?:[\d]{1,3},(?:[\d]{3},)*[\d]{3}|[0-9]+)(?:\.[0-9][0-9])?(?:\s?bn|\s?mil|\s?euro[s]?|\s?dollar[s]?|\s?pound[s]?|p){0,2})'
res = re.findall(regex, webpage)
print(res)
A webscraping solution:
import urllib
import itertools
from bs4 import BeautifulSoup as soup
import re
s = soup(str(urllib.urlopen('http://www.bbc.com/news/business-41779341').read()), 'lxml')
final_data = list(itertools.chain.from_iterable(filter(lambda x:x, [re.findall('[€\$£][\w\.]+', i.text) for i in s.findAll('p')])))
Output:
[u'$131bn', u'\xa3100bn', u'$100bn', u'$17.4bn.']
First, I want to grab this kind of string from a text file
{kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au
And then convert it to separate strings such as
kevin.knerr#google.com.au
sam.mcgettrick#google.com.au
mike.grahs#google.com.au
For example text file can be as:
Some gibberish words
{kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au
Some Gibberish words
As said in the comments, better grab the part in {} and use some programming logic afterwards. You can grab the different parts with:
\{(?P<individual>[^{}]+)\}#(?P<domain>\S+)
# looks for {
# captures everything not } into the group individual
# looks for # afterwards
# saves everything not a whitespace into the group domain
See a demo on regex101.com.
In Python this would be:
import re
rx = r'\{(?P<individual>[^{}]+)\}#(?P<domain>\S+)'
string = 'gibberish {kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au gibberish'
for match in re.finditer(rx, string):
print match.group('individual')
print match.group('domain')
Python Code
ip = "{kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au"
arr = re.match(r"\{([^\}]+)\}(\#\S+$)", ip)
#Using split for solution
for x in arr.group(1).split(","):
print (x.strip() + arr.group(2))
#Regex Based solution
arr1 = re.findall(r"([^, ]+)", arr.group(1))
for x in arr1:
print (x + arr.group(2))
IDEONE DEMO
links = re.findall('href="(http(s?)://[^"]+)"',page)
I have this regular expression to find all links in a website, I am getting this result:
('http://asecuritysite.com', '')
('https://www.sans.org/webcasts/archive/2013', 's')
When what I want is only this:
http://asecuritysite.com
https://www.sans.org/webcasts/archive/2013
If I eliminate the "( after the href it gives me loads of errors, can someone explain why?
If you use more than 1 capturing group, re.findall return list of tuples instead of list of strings. Try following (only using single group):
>>> import re
>>> page = '''
... here
... there
... '''
>>> re.findall(r'href="(https?:\/\/[^"]+)"',page)
['http://asecuritysite.com', 'https://www.sans.org/webcasts/archive/2013']
According to re.findall documentation:
If one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group.
Try getting rid of the second group (the (s?) in your original pattern):
links = re.findall('href="(https?:\/\/[^"]+)"',page)
What you are doing wrong is trying to parse HTML with Regex. And that sir, is a sin.
See here for the horrors of Regex parsing HTML
An alternative is to use something like lxml to parse the page and extract the links something like this
urls = html.xpath('//a/#href')
You're going to run into problems too if it's a single quote before the https? instead of double.
(https?:\/\/[^\"\'\>]+) will capture the entire string; what you could then do is prepend (href=.?) to it, and you'd end up with two capture groups:
Full regex: (href=.?)(https?:\/\/[^\"\'\>]+)
MATCH 1
[Group 1] href='
[Group 2] http://asecuritysite.com
MATCH 2
[Group 1] href='
[Group 2] https://www.sans.org/webcasts/archive/2013
http://regex101.com/r/gO8vV7 here is a working example
Suppose I got these urls.
http://abdd.eesfea.domainname.com/b/33tA$/0021/file
http://mail.domainname.org/abc/abc/aaa
http://domainname.edu
I just want to extract "domainame.com" or "domainname.org" or "domainname.edu" out.
How can I do this?
I think, I need to find the last "dot" just before "com|org|edu..." and print out content from this "dot"'s previous dot to this dot's next dot(if it has).
Need help about the regular-expres.
Thanks a lot!!!
I am using Python.
why use regex?
http://docs.python.org/library/urlparse.html
If you would like to go the regex route...
RFC-3986 is the authority regarding URIs. Appendix B provides this regex to break one down into its components:
re_3986 = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
# Where:
# scheme = $2
# authority = $4
# path = $5
# query = $7
# fragment = $9
Here is an enhanced, Python friendly version which utilizes named capture groups. It is presented in a function within a working script:
import re
def get_domain(url):
"""Return top two domain levels from URI"""
re_3986_enhanced = re.compile(r"""
# Parse and capture RFC-3986 Generic URI components.
^ # anchor to beginning of string
(?: (?P<scheme> [^:/?#\s]+): )? # capture optional scheme
(?://(?P<authority> [^/?#\s]*) )? # capture optional authority
(?P<path> [^?#\s]*) # capture required path
(?:\?(?P<query> [^#\s]*) )? # capture optional query
(?:\#(?P<fragment> [^\s]*) )? # capture optional fragment
$ # anchor to end of string
""", re.MULTILINE | re.VERBOSE)
re_domain = re.compile(r"""
# Pick out top two levels of DNS domain from authority.
(?P<domain>[^.]+\.[A-Za-z]{2,6}) # $domain: top two domain levels.
(?::[0-9]*)? # Optional port number.
$ # Anchor to end of string.
""",
re.MULTILINE | re.VERBOSE)
result = ""
m_uri = re_3986_enhanced.match(url)
if m_uri and m_uri.group("authority"):
auth = m_uri.group("authority")
m_domain = re_domain.search(auth)
if m_domain and m_domain.group("domain"):
result = m_domain.group("domain");
return result
data_list = [
r"http://abdd.eesfea.domainname.com/b/33tA$/0021/file",
r"http://mail.domainname.org/abc/abc/aaa",
r"http://domainname.edu",
r"http://domainname.com:80",
r"http://domainname.com?query=one",
r"http://domainname.com#fragment",
]
cnt = 0
for data in data_list:
cnt += 1
print("Data[%d] domain = \"%s\"" %
(cnt, get_domain(data)))
For more information regarding the picking apart and validation of a URI according to RFC-3986, you may want to take a look at an article I've been working on: Regular Expression URI Validation
In addition to Jase' answer.
If you don't wan't to use urlparse, just split the URL's.
Strip of the protocol (http:// or https://)
The you just split the string by first occurrence of '/'. This will leave you with something like:
'mail.domainname.org' on the second URL. This can then be split by '.' and the you just select the last two from the list by [-2]
This will always yield the domainname.org or whatever. Provided you get the protocol stripped out right, and that the URL are valid.
I would just use urlparse, but it can be done.
Dunno about the regex, but this is how I would do it.
Should you need more flexibility than urlparse provides, here's an example to get you started:
import re
def getDomain(url):
#requires 'http://' or 'https://'
#pat = r'(https?):\/\/(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
#'http://' or 'https://' is optional
pat = r'((https?):\/\/)?(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
m = re.match(pat, url)
if m:
domain = m.group('domain')
return domain
else:
return False
I used the named group (?P<domain>\w+) to grab the match, which is then indexed by its name, m.group('domain').
The great thing about learning regular expressions is that once you are comfortable with them, solving even the most complicated parsing problems is relatively simple. This pattern could be improved to be more or less forgiving if necessary -- this one for example will return '678' if you pass it 'http://123.45.678.90', but should work great on just about any other URL you can come up with. Regexr is a great resource for learning and testing regexes.