finding the regex to get a url between two phrases

finding the regex to get a url between two phrases - python

I have the following script trying to get this url: https://clips-media-assets.twitch.tv/178569498.mp4 which is in between {"quality":"1080","source":" and a " but my regex doesn't seem to be working
dt = """
<body>
<script>jQuery(window).load(function () {
setTimeout(function(){s
}, 1000);quality_options: [{"quality":"1080","source":"https://clips-media-assets.twitch.tv/178569498.mp4","frame_rate":60},{"quality":"720","source":"https://clips-media-assets.twitch.tv/AT-178569498-1280x720.mp4","frame_rate":60},{"quality":"480","source":"https://clips-media-assets.twitch.tv/AT-178569498-854x480.mp4","frame_rate":30},{"quality":"360","source":"https://clips-media-assets.twitch.tv/AT-178569498-640x360.mp4","frame_rate":30}]
});</script>
</body>
[download] 28.2x of 57.90MiB at 1.54MiB/s ETA 00:26
"""
pattern = re.compile(r'(?:\G(?!\A)|quality\":\"1080\",\"source\":\")(?:(?!\").)*', re.MULTILINE | re.DOTALL)
clipHTML = BeautifulSoup(dt, "html.parser")
scripts = clipHTML.findAll(['script'])
for script in scripts:
if script:
match = pattern.search(script.text)
if match:
email = match.group(0)
print(email)

If you insist on using a regex to solve this, try this one (as shown here):
(?<=quality\":\"1080\",\"source\":\")[^\"]+(?=\")
I don't know specifically about this case, but I have to mention that in general it's not ideal to parse JSON with regular expressions. Of course you can add dynamic-numbered spaces to the regex using ( *), but still I think it's better to use a JSON parser.

Related

Unable to get regex in python to match pattern

I'm trying to pull out a number from a copy of an HTML page which I got from using urllib.request
I've tried a few different patterns in regex but keep getting none as the output so I'm clearly not formatting the pattern correctly but can't get it to work
Below is a small part of the HTML I have in the string
</ul>\n \n <p>* * * * *</p>\n -->\n \n <b>DistroWatch database summary</b><br/>\n <ul>\n <li>Number of all distributions in the database: 926<br/>\n <li>Number of <a href="search.php?status=Active">
I'm trying to just get the 926 out of the string and my code is below and I can't figure out what I'm doing wrong
import urllib.request
import re
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
#print(page.read())
print(page.read())
pageString = str(page.read())
#print(pageString)
DistroCount = re.search('^all distributions</a> in the database: ....<br/>\n$', pageString)
print(DistroCount)
any help, pointers or resource suggestions would be much appreciated

You can use BeautifulSoup to convert HTML to text, and then apply a simple regex to extract a number after a hardcoded string:
import urllib.request, re
from bs4 import BeautifulSoup
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
html = page.read()
soup = BeautifulSoup(html, 'lxml')
text = soup.get_text()
m = re.search(r'all distributions in the database:\s*(\d+)', text)
if m:
print(m.group(1))
# => 926
Here,
soup.get_text() converts HTML to plain text and keeps it in the text variable
The all distributions in the database:\s*(\d+) regex matches all distributions in the database:, then zero or more whitespace chars and then captures into Group 1 any one or more digits (with (\d+))

I think your problem is that you are reading the whole document into a single string, but use "^" at beginning of your regex and "$" at the end, so the regex will only match the entire string.
Either drop ^ and $ (and \n as well…), or process your document line by line.

Extracting JSON from script tag using re and requests

I'm trying to extract a json from a script tag using regular expression, but I can't get it to match - however, my pattern works on https://regex101.com/ (using the page source to match against).
import requests
import re
req = requests.get(myURL)
matches = re.findall("/reports=\[([^]]+)\]/g", req.text)
print(matches)
The start of the json looks like this:
/*! jQuery v1.10.2 | (c) 2005, 2013 jQuery Foundation, Inc. | jquery.org/license
//# sourceMappingURL=jquery.min.map
*/
(function(e,t){var n,r,i=typeof t,o=e.location,a=e.document,s=a.documentElement,l=e.jQuery,u=e.$,c={},p=...
var reports=[
{
"Id": "ddb56456-ae7e-46da-8251-97630e1536f7",
Any pointers on what I'm doing wrong? If I write req.text to a text file then copy it into regex101.com I can match against it using the same pattern above.

You don't use the "slash" delimiters when you're specifying a string like this. Plus, the "g" flag is implied by findall. Just use:
matches = re.findall("reports=\[([^]]+)\]", req.text)

Using regular expressions to find URL not containing certain info

I'm working on a scraper/web crawler using Python 3.5 and the re module where one of its functions requires retrieving a YouTube channel's URL. I'm using the following portion of code that includes the matching of regular expression to accomplish this:
href = re.compile("(/user/|/channel/)(.+)")
What it should return is something like /user/username or /channel/channelname. It does this successfully for the most part, but every now and then it grabs a type of URL that includes more information like /user/username/videos?view=60 or something else that goes on after the username/ portion.
In an attempt to adress this issue, I rewrote the bit of code above as
href = re.compile("(/user/|/channel/)(?!(videos?view=60)(.+)")
along with other variations with no success. How can I rewrite my code so that it fetches URLS that do not include videos?view=60 anywhere in the URL?

Use the following approach with a specific regex pattern:
user_url = '/user/username/videos?view=60'
channel_url = '/channel/channelname/videos?view=60'
pattern = re.compile(r'(/user/|/channel/)([^/]+)')
m = re.match(pattern, user_url)
print(m.group()) # /user/username
m = re.match(pattern, channel_url)
print(m.group()) # /channel/channelname

I used This approach and it seems it does what you want.
import re
user = '/user/username/videos?view=60'
channel = '/channel/channelname/videos?view=60'
pattern = re.compile(r"(/user/|/channel/)[\w]+/")
user_match = re.search(pattern, user)
if user_match:
print user_match.group()
else:
print "Invalid Pattern"
pattern_match = re.search(pattern,channel)
if pattern_match:
print pattern_match.group()
else:
print "Invalid pattern"
Hope this helps!

count the number of images on a webpage, using urllib

For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:
import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)
def get_img_cnt(url):
try:
w = urllib.request.urlopen(url)
except IOError:
sys.stderr.write("Couldn't connect to %s " % url)
sys.exit(1)
contents = str(w.read())
img_num = len(img_pat.findall(contents))
return (img_num)
print (get_img_cnt('http://www.americascup.com/en/schedules/races'))

Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img tag count using BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
def get_img_cnt(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
return len(soup.find_all('img'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Here's a working example using lxml and requests:
from lxml import etree
import requests
def get_img_cnt(url):
response = requests.get(url)
parser = etree.HTMLParser()
root = etree.fromstring(response.content, parser=parser)
return int(root.xpath('count(//img)'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Both snippets print 106.
Also see:
Python Regex - Parsing HTML
Python regular expression for HTML parsing (BeautifulSoup)
Hope that helps.

Ahhh regular expressions.
Your regex pattern <img.*> says "Find me something that starts with <img and stuff and make sure it ends with >.
Regular expressions are greedy, though; it'll fill that .* with literally everything it can while leaving a single > character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html> and say "look! I found a > right there!"
You should come up with the right count by making .* non-greedy, like this:
<img.*?>

Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.
img_pat = re.compile('<img.*?>',re.I) will do the trick if you must do it the regex way. The ? makes it non-greedy.
A good website for checking what your regex matches on the fly: http://www.pyregex.com/
Learn more about regexes: http://docs.python.org/2/library/re.html

Python regex to match HTML

I am trying to request a web page via urllib2 using a regex.
Here is my code
def Get(url):
request = urllib2.Request(url)
page = urlOpener.open(request)
return page.read()
page = Get(myurl)
#page = "<html>.....</html>" #local string for test
pattern = re.compile(r'^\s*(<tr>$\s*<td height="25.*?</tr>)$', re.M | re.I | re.DOTALL)
for task in pattern.findall(taskListPage):
If I use a local string (same as Get(myurl)' s result) for test, the pattern works, but if i use Get(myurl), the pattern does not work.
I will be grateful if someone can tell me why.

Valid reservations about using regex on html aside, try this regex instead:
(<tr>\s*<td height="25.*?</tr>)
You were finding only matches at end of input $, and had problem terms at front of regex.
This match is a brittle - let's hope the web guy doesn't change the height of the rows...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

finding the regex to get a url between two phrases - python

Related

Unable to get regex in python to match pattern

Extracting JSON from script tag using re and requests

Using regular expressions to find URL not containing certain info

count the number of images on a webpage, using urllib

Python regex to match HTML

Categories

Resources