So, essentially my main issue comes from the regex part of findall. I'm trying to webscrape some information, but I can't for the life of me get any data to come out correctly. I thought that the (\S+ \S+) was the regex part, and I'd be extracting from any parts in between the HTML code of <li> and </li>, but instead, I get an empty list from print(data). I realize that I'm going to need a \S+ for every word in each of the list code parts, so how would I go about this? Also, how would I get it to post each one of the different parts of the HTML with the list code parts?
INPUT: Just the website. Mikky Ekko - Time
OUTPUT: In this case, it should be album titles (i.e. Mikky Ekko - Time)
import urllib.request
from re import findall
url = "http://rnbxclusive.se"
response = urllib.request.urlopen(url)
html = response.read()
htmlStr = str(html)
data = findall("<li>(\S+ \S+)</li>.*", htmlStr)
print(data)
for item in data:
print(item)
Use lxml
import lxml.html
doc = lxml.html.fromstring(response.read())
for li in doc.findall('.//li'):
print li.text_content()
<li>([^><]*)<\/li>
Try this.This will give all contents of <li> tag. flag.See demo.
http://regex101.com/r/dZ1vT6/55
Related
I'm working on parsing this web page.
I've got table = soup.find("div",{"class","accordions"}) to get just the fixtures list (and nothing else) however now I'm trying to loop through each match one at a time. It looks like each match starts with an article element tag <article role="article" about="/fixture/arsenal/2018-apr-01/stoke-city">
However for some reason when I try to use matches = table.findAll("article",{"role","article"})
and then print the length of matches, I get 0.
I've also tried to say matches = table.findAll("article",{"about","/fixture/arsenal"}) but get the same issue.
Is BeautifulSoup unable to parse tags, or am I just using it wrong?
Try this:
matches = table.findAll('article', attrs={'role': 'article'})
the reason is that findAll is searching for tag name. refer to bs4 docs
You need to pass the attributes as a dictionary. There are three ways in which you can get the data you want.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.arsenal.com/fixtures')
soup = BeautifulSoup(r.text, 'lxml')
matches = soup.find_all('article', {'role': 'article'})
print(len(matches))
# 16
Or, this is also the same:
matches = soup.find_all('article', role='article')
But, both these methods give some extra article tags that don't have the Arsernal fixtures. So, if you want to find them using /fixture/arsenal you can use CSS selectors. (Using find_all won't work, as you need a partial match)
matches = soup.select('article[about^=/fixture/arsenal]')
print(len(matches))
# 13
Also, have a look at the keyword arguments. It'll help you get what you want.
This is my first time posting, so please be gentle.
I'm extracting data from trip advisor. The reviews are interpreted with a figure that is represented like this.
<span class="ui_bubble_rating bubble_40"></span>
As you can see, there is a "40" in the end that represents 4 stars. The same happens with "20" (2 stars) etc...
How can I obtain the "ui_bubble_rating bubble_40"?
Thank you in advance...
I'm not sure if this is the most efficient way of doing that, but here's how I'd do it:
tags = soup.find_all(class=re.compile("bubble_\d\d"))
The tags variable will then include every tag in the page that matches the regex bubble_\d\d. After that, you just need to extract the number, like so:
stars = tags[0].split("_")[1]
If you want to be fancy, you can use list comprehensions to extract the numbers from every tag:
stars = [tag.split("_")[1] for tag in tags]
I am not sure what kind of data you are trying to scrape,
but you can obtain that span tag like so (I tested it and left some prints in):
from urllib import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("YOUR_REVIEWS_URL")
bs1=BeautifulSoup(html, 'lxml')
for s in bs1.findAll("span", {"class":"ui_bubble_rating bubble_40"}):
print(s)
More generic way (scrape all ratings (bubble_[0-9]{2})):
toFind = re.compile("(bubble_[0-9]{2})+")
for s in bs1.findAll("span", {"class":toFind}):
print(s)
Hope that answers your question
My goal is to scrape some auction IDs off an auction site page. The page is here
For the page I am interested in, there are approximately 60 auction ids. An auctionID is preceded by a dash, consists of 10 digits, and terminates before a .htm. For example in the link below the ID would be 0133346952
<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">
I have got as far as extracting ALL links from, by identifying "a" tags. This code is at the bottom of the page.
Based on my limited knowledge, I would say REGEX should be the right way to solve this. I was thinking REGEX something like :
-...........htm
However, I am failing to successfully integrate the regex into the code. I would have though something like
for links in soup.find_all('-...........htm'):
would have done the trick, but obviously not.
How can I fix this code?
import bs4
import requests
import re
res = requests.get('http://www.trademe.co.nz/browse/categorylistings.aspx?mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page=2&sort_order=default&rptpath=5-380-50-7145-')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for links in soup.find_all('-...........htm'):
print (links.get('href'))
Here's code that works:
for links in soup.find_all(href=re.compile("auction-[0-9]{10}.htm")):
h = links.get('href')
m = re.search("auction-([0-9]{10}).htm", h)
if m:
print(m.group(1))
First you need a regex to extract the href. Then you need a capture regex to extract the id.
import re
p = re.compile(r'-(\d{10})\.htm')
print(p.search('<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">'))
res = p.search('<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">')
print(res.group(1))
-(\d{10})\.htm means that you want a dash, 10 digits and .htm. What is more, these 10 digits are in capturing group, so you can extract them later.
You search for this pattern, and after that you have two groups: one with whole pattern, and one with capturing group (only 10 digits).
in python you can do:
import re
text = """<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">"""
p = re.compile(r'(?<=<a\shref=").*?(?=")')
re.findall(p,text) ## ['/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm']
You have to pass in a regular expression object to find_all() you are
just handing in a string that you want to use as the pattern for a regex.
To learn and debug this kind of stuff, it is useful to cache the data from the site until things work:
import bs4
import requests
import re
import os
# don't want to download while experimenting
tmp_file = 'data.html'
if True and os.path.exists('data.html'): # switch True to false in production
with open(tmp_file) as fp:
data = fp.read()
else:
res = requests.get('http://www.trademe.co.nz/browse/categorylistings.aspx?mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page=2&sort_order=default&rptpath=5-380-50-7145-')
res.raise_for_status()
data = res.text
with open(tmp_file, 'w') as fp:
fp.write(data)
soup = bs4.BeautifulSoup(data, 'html.parser')
# and start experimenting with your regular expressions
regex = re.compile('...........htm')
for links in soup.find_all(regex):
print (links.get('href'))
# the above doesn't find anything, you need to search the hrefs
print('try again')
for links in soup.find_all(href=regex):
print (links.get('href'))
And once you get some matches you can improve on your regex pattern, using more sophisticated techniques, but that is in my exerience less important than starting with right "framework" for trying things out quickly (without waiting for a download on every code change tested).
This is simple; you don't need regex. Let s be your string (I couldn't put the whole line here due to my not knowing how to handle the wrap around.)
s = '<a href="....../auction-1033346952.htm......>'
i = s.find('auction-')
j = s[i+8:i+18]
print j
Most simplest way wo regexes
>>> s='<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">'
>>> s.split('.htm')[0].split('-')[-1]
'1033346952'
I was playing around with pattern matches in different html codes of sites I noticed something weird. I used this pattern :
pat = <div class="id-app-orig-desc">.*</div>
I used it on a app page of the play store(Picked a random app). So according to me it should just give what's between the div tags (ie the description) but that does not happen. I gives everything starting from the first of the pattern and goes on till the last of the page completely ignoring in between. Anyone knows what's happening?!
And I check the length of the list returned it's just 1.
First of all, do not parse HTML with regex, use a specialized tool - HTML parser. For example, BeautifulSoup:
from bs4 import BeautifulSoup
data = """
<div>
<div class="id-app-orig-desc">
Do not try to get me with a regex, please.
</div>
</div>
"""
soup = BeautifulSoup(data)
print soup.find('div', {'class': 'id-app-orig-desc'}).text.strip()
Prints:
Do not try to get me with a regex, please.
I have written a script which is posted below, which basically goes to the plain text dictionary website and searches for the entered word and retrieves the definition. The only problem is it returns with the closing paragraph tags aswell, i have messed around with this for ages.
#!/usr/bin/python
import urllib2
import re
import sys
word = 'Xylophone'
page = urllib2.urlopen('http://www.mso.anu.edu.au/~ralph/OPTED/v003/wb1913_'+word[0].lower()+'.html')
html = page.read()
match = re.search(r'<P><B>'+word+'</B>.............(.*)', html)
if match:
print match.group(1)
else: print 'not found'
This returns the definition with tags. Whats the correct regex syntax here to ignore tags?
Prerequisite: read RegEx match open tags except XHTML self-contained tags famous topic.
Since it is an html page you are parsing, I'd use a specific tool made for this - an HTML parser.
For example, BeautifulSoup:
import urllib2
from bs4 import BeautifulSoup
word = 'Xylophone'
page = urllib2.urlopen('http://www.mso.anu.edu.au/~ralph/OPTED/v003/wb1913_'+word[0].lower()+'.html')
soup = BeautifulSoup(page)
print soup.find('b', text=word).parent.text
prints:
Xylophone (n.) An instrument common among the Russians, Poles, and
Tartars, consisting of a series of strips of wood or glass graduated
in length to the musical scale, resting on belts of straw, and struck
with two small hammers. Called in Germany strohfiedel, or straw
fiddle.