Python strip Google Alerts URL

Python strip Google Alerts URL - python

I've currently got a dataframe filled with Google Alert URLS that look like:
link = 'https://www.google.com/url?rct=j&sa=t&url=http://3dprint.com/4353/littledlper-dlp-3d-printer-kickstarter/&ct=ga&cd=CAEYBCoSODQ1OTg1ODMwMzQwNDUzMTUxMhw2NTFlMTg3MTI1ZGE4Nzc3OmNvLnVrOmVuOkdC&usg=AFQjCNF0HOEhqIZHEpdkH1eVdXt-JRBF3Q'
and I just want the part following url= and before the junk.
http://3dprint.com/4353/littledlper-dlp-3d-printer-kickstarter/
I used urllib.parse.urlparse(link) to get a list of URL elements...
parsed = ParseResult(scheme='https', netloc='www.google.com', path='/url', params='', query='rct=j&sa=t&url=http://3dprint.com/4353/littledlper-dlp-3d-printer-kickstarter/&ct=ga&cd=CAEYBCoSODQ1OTg1ODMwMzQwNDUzMTUxMhw2NTFlMTg3MTI1ZGE4Nzc3OmNvLnVrOmVuOkdC&usg=AFQjCNF0HOEhqIZHEpdkH1eVdXt-JRBF3Q', fragment='')
but even then parsed[4] only breaks it down to...
'rct=j&sa=t&url=http://3dprint.com/4353/littledlper-dlp-3d-printer-kickstarter/&ct=ga&cd=CAEYBCoSODQ1OTg1ODMwMzQwNDUzMTUxMhw2NTFlMTg3MTI1ZGE4Nzc3OmNvLnVrOmVuOkdC&usg=AFQjCNF0HOEhqIZHEpdkH1eVdXt-JRBF3Q'
I found other queries on Stack with this same question but they were in other programming languages than Python.
Any ideas on a Python approach?

You may use a regex on parsed[4] to extract that URL:
(?:^|&)url=([^&]+)
See the regex demo
Details:
(?:^|&) - either start of string or &
url= - literal text url=
([^&]+) - Group 1 capturing one or more symbols other than &.
Python demo:
import re
p = re.compile(r'(?:^|&)url=([^&]+)')
s = "rct=j&sa=t&url=http://3dprint.com/4353/littledlper-dlp-3d-printer-kickstarter/&ct=ga&cd=CAEYBCoSODQ1OTg1ODMwMzQwNDUzMTUxMhw2NTFlMTg3MTI1ZGE4Nzc3OmNvLnVrOmVuOkdC&usg=AFQjCNF0HOEhqIZHEpdkH1eVdXt-JRBF3Q"
mObj = p.search(s)
if mObj:
print(mObj.group(1))

Related

Unable to get regex in python to match pattern

I'm trying to pull out a number from a copy of an HTML page which I got from using urllib.request
I've tried a few different patterns in regex but keep getting none as the output so I'm clearly not formatting the pattern correctly but can't get it to work
Below is a small part of the HTML I have in the string
</ul>\n \n <p>* * * * *</p>\n -->\n \n <b>DistroWatch database summary</b><br/>\n <ul>\n <li>Number of all distributions in the database: 926<br/>\n <li>Number of <a href="search.php?status=Active">
I'm trying to just get the 926 out of the string and my code is below and I can't figure out what I'm doing wrong
import urllib.request
import re
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
#print(page.read())
print(page.read())
pageString = str(page.read())
#print(pageString)
DistroCount = re.search('^all distributions</a> in the database: ....<br/>\n$', pageString)
print(DistroCount)
any help, pointers or resource suggestions would be much appreciated

You can use BeautifulSoup to convert HTML to text, and then apply a simple regex to extract a number after a hardcoded string:
import urllib.request, re
from bs4 import BeautifulSoup
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
html = page.read()
soup = BeautifulSoup(html, 'lxml')
text = soup.get_text()
m = re.search(r'all distributions in the database:\s*(\d+)', text)
if m:
print(m.group(1))
# => 926
Here,
soup.get_text() converts HTML to plain text and keeps it in the text variable
The all distributions in the database:\s*(\d+) regex matches all distributions in the database:, then zero or more whitespace chars and then captures into Group 1 any one or more digits (with (\d+))

I think your problem is that you are reading the whole document into a single string, but use "^" at beginning of your regex and "$" at the end, so the regex will only match the entire string.
Either drop ^ and $ (and \n as well…), or process your document line by line.

Using regular expressions to find URL not containing certain info

I'm working on a scraper/web crawler using Python 3.5 and the re module where one of its functions requires retrieving a YouTube channel's URL. I'm using the following portion of code that includes the matching of regular expression to accomplish this:
href = re.compile("(/user/|/channel/)(.+)")
What it should return is something like /user/username or /channel/channelname. It does this successfully for the most part, but every now and then it grabs a type of URL that includes more information like /user/username/videos?view=60 or something else that goes on after the username/ portion.
In an attempt to adress this issue, I rewrote the bit of code above as
href = re.compile("(/user/|/channel/)(?!(videos?view=60)(.+)")
along with other variations with no success. How can I rewrite my code so that it fetches URLS that do not include videos?view=60 anywhere in the URL?

Use the following approach with a specific regex pattern:
user_url = '/user/username/videos?view=60'
channel_url = '/channel/channelname/videos?view=60'
pattern = re.compile(r'(/user/|/channel/)([^/]+)')
m = re.match(pattern, user_url)
print(m.group()) # /user/username
m = re.match(pattern, channel_url)
print(m.group()) # /channel/channelname

I used This approach and it seems it does what you want.
import re
user = '/user/username/videos?view=60'
channel = '/channel/channelname/videos?view=60'
pattern = re.compile(r"(/user/|/channel/)[\w]+/")
user_match = re.search(pattern, user)
if user_match:
print user_match.group()
else:
print "Invalid Pattern"
pattern_match = re.search(pattern,channel)
if pattern_match:
print pattern_match.group()
else:
print "Invalid pattern"
Hope this helps!

How to match 0 or 1 time character at the end of line?

I am trying to normalize a URL, to extract the content after :// and before the last / at the end of line if it exists.
Here is my script:
url = "https://example.com/25194425/"
matchUrl = re.findall(r'://(.*)/?$', url)
print matchUrl
What I want is example.com/25194425, but I get example.com/25194425/. How to deal with the last /?
Why doesn't /? work?

An alternative way to do it without using regex is using urlparse
>>> from urlparse import urlparse
>>> url = 'https://example.com/25194425/'
>>> '{url.netloc}{url.path}'.format(url=urlparse(url)).rstrip('/')
'example.com/25194425'
Later on, if you want to include the protocol, port, params, ... parts into the normalized url. It can be done easier (than updating the regex)
>>> '{url.scheme}://{url.netloc}{url.path}'.format(url=urlparse(url)).rstrip('/')
'https://example.com/25194425'

As one of the commenters said, you just need to make the quantifier non-greedy:
://(.*?)/?$
However, the result of findall() is a list, not a string. In this case it's list with only one entry, but it's still a list. To get the actual string, you need to provide the index:
url = "https://example.com/25194425/"
match = re.findall(r'://(.*?)/?$', url)
print match[0]
But that seems like an inappropriate use of findall() to me. I would have gone with search():
url = "https://example.com/25194425/"
match = re.search(r'://(.*?)/?$', url)
if match:
print match.group(1)

The default is possible because the regular match more characters. So '(.*) /' will match to the last slash.
You can use it:
matchUrl = re.findall(r'://(.*)/[^/]?$', url)

EDIT Please try the following pattern (python 2.7x):
import re
url1 = 'https://example.com/25194425?/'
url2 = 'https://example.com/25194425?'
print re.findall('https?://([\S]+)(?<!/)[/]?', url1)
print re.findall('https?://([\S]+)(?<!/)[/]?', url2)
Output:
['example.com/25194425?']
['example.com/25194425?']
Thanks #Alan Moore for pointing out the word boundary issue. Now it should work for both scenarios.

count the number of images on a webpage, using urllib

For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:
import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)
def get_img_cnt(url):
try:
w = urllib.request.urlopen(url)
except IOError:
sys.stderr.write("Couldn't connect to %s " % url)
sys.exit(1)
contents = str(w.read())
img_num = len(img_pat.findall(contents))
return (img_num)
print (get_img_cnt('http://www.americascup.com/en/schedules/races'))

Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img tag count using BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
def get_img_cnt(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
return len(soup.find_all('img'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Here's a working example using lxml and requests:
from lxml import etree
import requests
def get_img_cnt(url):
response = requests.get(url)
parser = etree.HTMLParser()
root = etree.fromstring(response.content, parser=parser)
return int(root.xpath('count(//img)'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Both snippets print 106.
Also see:
Python Regex - Parsing HTML
Python regular expression for HTML parsing (BeautifulSoup)
Hope that helps.

Ahhh regular expressions.
Your regex pattern <img.*> says "Find me something that starts with <img and stuff and make sure it ends with >.
Regular expressions are greedy, though; it'll fill that .* with literally everything it can while leaving a single > character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html> and say "look! I found a > right there!"
You should come up with the right count by making .* non-greedy, like this:
<img.*?>

Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.
img_pat = re.compile('<img.*?>',re.I) will do the trick if you must do it the regex way. The ? makes it non-greedy.
A good website for checking what your regex matches on the fly: http://www.pyregex.com/
Learn more about regexes: http://docs.python.org/2/library/re.html

Python regex to match HTML

I am trying to request a web page via urllib2 using a regex.
Here is my code
def Get(url):
request = urllib2.Request(url)
page = urlOpener.open(request)
return page.read()
page = Get(myurl)
#page = "<html>.....</html>" #local string for test
pattern = re.compile(r'^\s*(<tr>$\s*<td height="25.*?</tr>)$', re.M | re.I | re.DOTALL)
for task in pattern.findall(taskListPage):
If I use a local string (same as Get(myurl)' s result) for test, the pattern works, but if i use Get(myurl), the pattern does not work.
I will be grateful if someone can tell me why.

Valid reservations about using regex on html aside, try this regex instead:
(<tr>\s*<td height="25.*?</tr>)
You were finding only matches at end of input $, and had problem terms at front of regex.
This match is a brittle - let's hope the web guy doesn't change the height of the rows...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python strip Google Alerts URL - python

Related

Unable to get regex in python to match pattern

Using regular expressions to find URL not containing certain info

How to match 0 or 1 time character at the end of line?

count the number of images on a webpage, using urllib

Python regex to match HTML

Categories

Resources