REGEX to extract part of link - python

My goal is to scrape some auction IDs off an auction site page. The page is here
For the page I am interested in, there are approximately 60 auction ids. An auctionID is preceded by a dash, consists of 10 digits, and terminates before a .htm. For example in the link below the ID would be 0133346952
<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">
I have got as far as extracting ALL links from, by identifying "a" tags. This code is at the bottom of the page.
Based on my limited knowledge, I would say REGEX should be the right way to solve this. I was thinking REGEX something like :
-...........htm
However, I am failing to successfully integrate the regex into the code. I would have though something like
for links in soup.find_all('-...........htm'):
would have done the trick, but obviously not.
How can I fix this code?
import bs4
import requests
import re
res = requests.get('http://www.trademe.co.nz/browse/categorylistings.aspx?mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page=2&sort_order=default&rptpath=5-380-50-7145-')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for links in soup.find_all('-...........htm'):
print (links.get('href'))

Here's code that works:
for links in soup.find_all(href=re.compile("auction-[0-9]{10}.htm")):
h = links.get('href')
m = re.search("auction-([0-9]{10}).htm", h)
if m:
print(m.group(1))
First you need a regex to extract the href. Then you need a capture regex to extract the id.

import re
p = re.compile(r'-(\d{10})\.htm')
print(p.search('<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">'))
res = p.search('<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">')
print(res.group(1))
-(\d{10})\.htm means that you want a dash, 10 digits and .htm. What is more, these 10 digits are in capturing group, so you can extract them later.
You search for this pattern, and after that you have two groups: one with whole pattern, and one with capturing group (only 10 digits).

in python you can do:
import re
text = """<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">"""
p = re.compile(r'(?<=<a\shref=").*?(?=")')
re.findall(p,text) ## ['/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm']

You have to pass in a regular expression object to find_all() you are
just handing in a string that you want to use as the pattern for a regex.
To learn and debug this kind of stuff, it is useful to cache the data from the site until things work:
import bs4
import requests
import re
import os
# don't want to download while experimenting
tmp_file = 'data.html'
if True and os.path.exists('data.html'): # switch True to false in production
with open(tmp_file) as fp:
data = fp.read()
else:
res = requests.get('http://www.trademe.co.nz/browse/categorylistings.aspx?mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page=2&sort_order=default&rptpath=5-380-50-7145-')
res.raise_for_status()
data = res.text
with open(tmp_file, 'w') as fp:
fp.write(data)
soup = bs4.BeautifulSoup(data, 'html.parser')
# and start experimenting with your regular expressions
regex = re.compile('...........htm')
for links in soup.find_all(regex):
print (links.get('href'))
# the above doesn't find anything, you need to search the hrefs
print('try again')
for links in soup.find_all(href=regex):
print (links.get('href'))
And once you get some matches you can improve on your regex pattern, using more sophisticated techniques, but that is in my exerience less important than starting with right "framework" for trying things out quickly (without waiting for a download on every code change tested).

This is simple; you don't need regex. Let s be your string (I couldn't put the whole line here due to my not knowing how to handle the wrap around.)
s = '<a href="....../auction-1033346952.htm......>'
i = s.find('auction-')
j = s[i+8:i+18]
print j

Most simplest way wo regexes
>>> s='<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">'
>>> s.split('.htm')[0].split('-')[-1]
'1033346952'

Related

BeautifulSoup won't parse Article element

I'm working on parsing this web page.
I've got table = soup.find("div",{"class","accordions"}) to get just the fixtures list (and nothing else) however now I'm trying to loop through each match one at a time. It looks like each match starts with an article element tag <article role="article" about="/fixture/arsenal/2018-apr-01/stoke-city">
However for some reason when I try to use matches = table.findAll("article",{"role","article"})
and then print the length of matches, I get 0.
I've also tried to say matches = table.findAll("article",{"about","/fixture/arsenal"}) but get the same issue.
Is BeautifulSoup unable to parse tags, or am I just using it wrong?
Try this:
matches = table.findAll('article', attrs={'role': 'article'})
the reason is that findAll is searching for tag name. refer to bs4 docs
You need to pass the attributes as a dictionary. There are three ways in which you can get the data you want.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.arsenal.com/fixtures')
soup = BeautifulSoup(r.text, 'lxml')
matches = soup.find_all('article', {'role': 'article'})
print(len(matches))
# 16
Or, this is also the same:
matches = soup.find_all('article', role='article')
But, both these methods give some extra article tags that don't have the Arsernal fixtures. So, if you want to find them using /fixture/arsenal you can use CSS selectors. (Using find_all won't work, as you need a partial match)
matches = soup.select('article[about^=/fixture/arsenal]')
print(len(matches))
# 13
Also, have a look at the keyword arguments. It'll help you get what you want.

using python urllib and beautiful soup to extract information from html site

I am trying to extract some information from this website i.e. the line which says:
Scale(Virgo + GA + Shapley): 29 pc/arcsec = 0.029 kpc/arcsec = 1.72 kpc/arcmin = 0.10 Mpc/degree
but everything after the : is variable depending on galtype.
I have written a code which used beautifulsoup and urllib and returns sone information, but i am struggling to reduce the data further to just the information I want. How do I get just the information I want?
galname='M82'
a='http://ned.ipac.caltech.edu/cgi-bin/objsearch?objname='+galname+'&extend'+\
'=no&hconst=73&omegam=0.27&omegav=0.73&corr_z=1&out_csys=Equatorial&out_equinox=J2000.0&obj'+\
'_sort=RA+or+Longitude&of=pre_text&zv_breaker=30000.0&list_limit=5&img_stamp=YES'
print a
import urllib
f = urllib.urlopen(a)
from bs4 import BeautifulSoup
soup=BeautifulSoup(f)
soup.find_all(text=re.compile('Virgo')) and soup.find_all(text=re.compile('GA')) and soup.find_all(text=re.compile('Shapley'))
Define a regular expression pattern that would help BeautifulSoup to find the appropriate node, then, extract the number using saving groups:
pattern = re.compile(r"D \(Virgo \+ GA \+ Shapley\)\s+:\s+([0-9\.]+)")
print pattern.search(soup.find(text=pattern)).group(1)
Prints 5.92.
Besides, usually I'm against using regular expressions to parse HTML, but, since this is a text search and we are not going to use regular expressions to match opening or closing tags or anything related to the structure that HTML provides - you can just apply your pattern to the HTML source of the page without involving an HTML parser:
data = f.read()
pattern = re.compile(r"D \(Virgo \+ GA \+ Shapley\)\s+:\s+([0-9\.]+)")
print pattern.search(data).group(1)

Python findall regex issue

So, essentially my main issue comes from the regex part of findall. I'm trying to webscrape some information, but I can't for the life of me get any data to come out correctly. I thought that the (\S+ \S+) was the regex part, and I'd be extracting from any parts in between the HTML code of <li> and </li>, but instead, I get an empty list from print(data). I realize that I'm going to need a \S+ for every word in each of the list code parts, so how would I go about this? Also, how would I get it to post each one of the different parts of the HTML with the list code parts?
INPUT: Just the website. Mikky Ekko - Time
OUTPUT: In this case, it should be album titles (i.e. Mikky Ekko - Time)
import urllib.request
from re import findall
url = "http://rnbxclusive.se"
response = urllib.request.urlopen(url)
html = response.read()
htmlStr = str(html)
data = findall("<li>(\S+ \S+)</li>.*", htmlStr)
print(data)
for item in data:
print(item)
Use lxml
import lxml.html
doc = lxml.html.fromstring(response.read())
for li in doc.findall('.//li'):
print li.text_content()
<li>([^><]*)<\/li>
Try this.This will give all contents of <li> tag. flag.See demo.
http://regex101.com/r/dZ1vT6/55

Finding urls containing a specific string

I haven't used RegEx before, and everyone seems to agree that it's bad for webscraping and html in particular, but I'm not really sure how to solve my little challenge without.
I have a small Python scraper that opens 24 different webpages. In each webpage, there's links to other webpages. I want to make a simple solution that gets the links that I need and even though the webpages are somewhat similar, the links that I want are not.
The only common thing between the urls seems to be a specific string: 'uge' or 'Uge' (uge means week in Danish - and the week number changes every week, duh). It's not like the urls have a common ID or something like that I could use to target the correct ones each time.
I figure it would be possible using RegEx to go through the webpage and find all urls that has 'uge' or 'Uge' in them and then open them. But is there a way to do that using BS? And if I do it using RegEx, how would a possible solution look like?
For example, here are two of the urls I want to grab in different webpages:
http://www.domstol.dk/KobenhavnsByret/retslister/Pages/Uge45-Tvangsauktioner.aspx
http://www.domstol.dk/esbjerg/retslister/Pages/Straffesageruge32.aspx
This should work... The RegEx uge\d\d? tells it to find "uge" followed by a digit, and possibly another one.
import re
for item in listofurls:
l = re.findall("uge\d\d?", item, re.IGNORECASE):
if l:
print item #just do whatever you want to do when it finds it
Yes, you can do this with BeautifulSoup.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
# To find just 'Uge##' or 'uge##', as specified in the question:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("[Uu]ge\d+"))]
# To find without regard to case at all:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("(?i)uge\d+"))]
Or just use a simple for loop:
list_of_urls = ["""LIST GOES HERE"""]
for url in list_of_urls:
if 'uge' in url.lower():
# Code to execute
The regex expression would look something like: uge\d\d

python url fetch help - regex

I have a web site where there are links like <a href="http://www.example.com?read.php=123"> Can anybody show me how to get all the numbers (123, in this case) in such links using python? I don't know how to construct a regex. Thanks in advance.
import re
re.findall("\?read\.php=(\d+)",data)
"If you have a problem, and decide to use regex, now you have two problems..."
If you are reading one particular web page and you know how it is formatted, then regex is fine - you can use S. Mark's answer. To parse a particular link, you can use Kimvai's answer. However, to get all the links from a page, you're better off using something more serious. Any regex solution you come up with will have flaws,
I recommend mechanize. If you notice, the Browser class there has a links method which gets you all the links in a page. It has the added benefit of being able to download the page for you =) .
This will work irrespective of how your links are formatted (e.g. if some look like <a href="foo=123"/> and some look like <A TARGET="_blank" HREF='foo=123'/>).
import re
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
p = re.compile('^.*=([\d]*)$')
for a in soup.findAll('a'):
m = p.match(a["href"])
if m:
print m.groups()[0]
While the other answers are sort of correct, you should probably use the urllib2 library instead;
from urllib2 import urlparse
import re
urlre = re.compile('<a[^>]+href="([^"]+)"[^>]*>',re.IGNORECASE)
links = urlre.findall('<a href="http://www.example.com?read.php=123">')
for link in links:
url = urlparse.urlparse(link)
s = [x.split("=") for x in url[4].split(';')]
d = {}
for k,v in s:
d[k]=v
print d["read.php"]
It's not as simple as some of the above, but guaranteed to work even with more complex urls.
/[0-9]/
thats the regex sytax you want
for reference see
http://gnosis.cx/publish/programming/regular_expressions.html
One without the need for regex
>>> s='<a href="http://www.example.com?read.php=123">'
>>> for item in s.split(">"):
... if "href" in item:
... print item[item.index("a href")+len("a href="): ]
...
"http://www.example.com?read.php=123"
if you want to extract the numbers
item[item.index("a href")+len("a href="): ].split("=")[-1]

Categories

Resources