I write a web scraper with requests and BeautifulSoup, and there's an element in the DOM I can't find.
Here's what I do:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.decitre.fr/rechercher/result/?q=victor+hugo&search-scope=3')
soup = BeautifulSoup(r.text)
The element I can't find is the "old-price" (the one which is striked-through), which I can see when I inspect the DOM with a browser dev tool.
soup.find_all(class_='old-price') # returns [], no matter if I specify "span"
Moreover I can't see the 'old-price' string in the soup or the result of the request:
'old-price' in soup.text # False
'old-price' in r.text # False
I can't see it when I get the source with wget too.
I can get its div parent, but can't find price children inside it:
commands = soup.find_all(class_='product_commande')
commands[0].find_all('old-price') # []
So I have no idea what's going on. What am I missing ?
do I badly use request/BeautifulSoup ? (I'm not sure if r.text returns the full html)
is that html part generated with a javascript code ? if so how can I know it and is there a way to get the complete html ?
many thanks
In my case I was passing invalid HTML into Beautiful Soup which was causing it to ignore everything after the invalid tag at the start of the document:
<!--?xml version="1.0" encoding="iso-8859-1"?-->
Note that I am also using Ghost.py. Here is how I removed the tag.
#remove invalid xml tag
ghostContent = ghost.content
invalidCode = '<!--?xml version="1.0" encoding="iso-8859-1"?-->'
if ghostContent.startswith(invalidCode):
ghostContent = ghostContent[len(invalidCode):]
doc = BeautifulSoup(ghostContent)
#test to see if we can find text
if 'Application Search Results' in doc.text:
print 'YES!'
Related
I am trying to web-scrape using an h2 tag, but BeautifulSoup returns an empty list.
<h2 class="iCIMS_InfoMsg iCIMS_InfoField_Job">
html=urlopen("https://careersus-endologix.icims.com/jobs/2034/associate-supplier-quality-engineer/job")
bs0bj=BeautifulSoup(html,"lxml")
nameList=bs0bj.findAll("h2",{"class":"iCIMS_InfoMsg iCIMS_InfoField_Job"})
print(nameList)
The content is inside an iframe and updated via js (so not present in initial request). You can use the same link the page is using to obtain iframe content (the iframe src). Then extract the string from the script tag that has the info and load with json, extract the description (which is html) and pass back to bs to then select the h2 tags. You now have the rest of the info stored in the second soup object as well if required.
import requests
from bs4 import BeautifulSoup as bs
import json
r = requests.get('https://careersus-endologix.icims.com/jobs/2034/associate-supplier-quality-engineer/job?mobile=false&width=1140&height=500&bga=true&needsRedirect=false&jan1offset=0&jun1offset=60&in_iframe=1')
soup = bs(r.content, 'lxml')
script = soup.select_one('[type="application/ld+json"]').text
data = json.loads(script)
soup = bs(data['description'], 'lxml')
headers = [item.text for item in soup.select('h2')]
print(headers)
The answer lays hidden in two elements:
javascript rendered contents: after document.onload
in particular the content managed by js comes after this comment and it's, indeed, rendered by js. The line where the block starts is: "< ! - -BEGIN ICIMS - - >" (space added to avoid it goes blank)
As you can imagine the h2 class="ICISM class here" DOESN'T exist WHEN you call the bs4 methods.
The solution?
IMHO the best way to achieve what you want is to use selenium, to get a full rendered web page.
check this also
Web-scraping JavaScript page with Python
I'm fairly new to scraping with Python.
I am trying to obtain the number of search results from a query on Exilead. In this example I would like to get "
586,564 results".
This is the code I am running:
r = requests.get(URL, headers=headers)
tree = html.fromstring(r.text)
stats = tree.xpath('//[#id="searchform"]/div/div/small/text()')
This returns an empty list.
I copy-pasted the xPath directly from the elements' page.
As an alternative, I have tried using Beautiful soup:
html = r.text
soup = BeautifulSoup(html, 'xml')
stats = soup.find('small', {'class': 'pull-right'}).text
which returns a Attribute error: NoneType object does not have attribute text.
When I checked the html source I realised I actually cannot find the element I am looking for (the number of results) on the source.
Does anyone know why this is happening and how this can be resolved?
Thanks a lot!
When I checked the html source I realised I actually cannot find the element I am looking for (the number of results) on the source.
This suggests that the data you're looking for is dynamically generated with javascript. You'll need to be able to see the element you're looking for in the html source.
To confirm this being the cause of your error, you could try something really simple like:
html = r.text
soup = BeautifulSoup(html, 'lxml')
*note the 'lxml' above.
And then manually check 'soup' to see if your desired element is there.
I can get that with a css selector combination of small.pull-right to target the tag and the class name of the element.
from bs4 import BeautifulSoup
import requests
url = 'https://www.exalead.com/search/web/results/?q=lead+poisoning'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
print(soup.select_one('small.pull-right').text)
I need download few links in a html. But I don't need all of them, I only need few of them in certain section on this webpage.
For example, in http://www.nytimes.com/roomfordebate/2014/09/24/protecting-student-privacy-in-online-learning, I need links in the debaters section. I plan to use BeautifulSoup and I looked the html of one of the links:
Data Collection Is Out of Control
Here's my code:
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
link_set = set()
for link in soup.find_all("a", class = "bl-bigger"):
href = link.get('href')
if href == None:
continue
elif '/roomfordebate/' in href:
link_set.add(href)
for link in link_set:
print link
This code is supposed to give me all the links with bl-bigger class. But it actually returns nothing. Could anyone figure what's wrong with my code or how to make it work?
Thanks
I don't see bl-bigger class at all when I view the source from Chrome. May be that's why your code is not working?
Lets start looking at the source. The whole Debaters section seems to be put within div with class nytint-discussion-content. So using BeautifulSoup, lets get that whole div first.
debaters_div = soup.find('div', class_="nytint-discussion-content")
Again learning from the source, seems all the links are within a list, li tag. Now all you have to do is, find all li tags and find anchor tags within them. One more thing you can notice is, all the li tags have class nytint-bylines-1.
list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
list_items[0].find('a')
# Data Collection Is Out of Control
So, your whole code can be:
link_set = set()
response = requests.get(url)
html_data = response.text
soup = BeautifulSoup(html_data)
debaters_div = soup.find('div', class_="nytint-discussion-content")
list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
for each_item in list_items:
html_link = each_item.find('a').get('href')
if html_link.startswith('/roomfordebate'):
link_set.add(html_link)
Now link_set will contain all the links you want. From the link given in question, it will fetch 5 links.
PS: link_set contains only uri and not actual html addresses. So I would add http://www.nytimes.com at start before adding those links to link_set. Just change the last line to:
link_set.add('http://www.nytimes.com' + html_link)
You need to call the method with an object instead of keyword argument:
soup.find("tagName", { "class" : "cssClass" })
or use .select method which executes CSS queries:
soup.select('a.bl-bigger')
Examples are in the docs, just search for '.select' string. Also, instead of writing the entire script you'll quickly get some working code with ipython interactive shell.
I need to check a webpage search results and compare them to user input.
ui = raw_input() #for example "Niels Bohr"
link = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=90&k=10"
stranica=urllib.urlopen(link)
soup = BeautifulSoup(stranica, from_encoding="utf-8")
beauty = soup.prettify()
print beauty
since there is 1502 results, my idea was to change the k=10 to k=1502. Now I need some kind of function to check if search results contain my user input. I know that my names are the text after TEXT
so how to do it? maybe using regex?
the second part is if there are matching results to get the link of the results. Again, I know that link is inside that href="", but how to get it out and make it usable=
Finding if Niels Bohr is listed is as easy as using a large batch number and loading the resulting page:
import sys
import urllib2
from bs4 import BeautifulSoup
url = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=0&k={}".format(sys.maxint)
name = u'Bohr, Niels'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
for link in soup.find_all(class_='AllWordsTextHit', text=name):
print link
This produces any links that contain the text 'Bohr, Niels' as the link text. You can use a regular expression if you need a partial match.
The link object has a (relative) href attribute you can then use to load the next page:
professor_page = 'http://www.enciklopedija.hr/' + link['href']
I am using python and beautifulsoup for html parsing.
I am using the following code :
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url = "http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query"
main_url = urllib2.urlopen(url)
content = main_url.read()
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
print a[href]
but I am not getting output links like :
http://www.wikipathways.org/index.php/Pathway:WP26
and also imp thing is there are 107 pathways. but I will not get all the links as other lins depends on "show links" at the bottom of the page.
so, how can I get all the links (107 links) from that url?
Your problem is line 8, content = url.read(). You're not actually reading the webpage, you're actually just doing nothing (If anything, you should be getting an error).
main_url is what you want to read, so change line 8 to:
content = main_url.read()
You also have another error, print a[href]. href should be a string, so it should be:
print a['href']
I would suggest using lxml its faster and better for parsing html worth investing the time to learn it.
from lxml.html import parse
dom = parse('http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query').getroot()
links = dom.cssselect('a')
That should get you going.