I am trying to crawl a website (with python) and get its users info. But when I download the source of the pages, it is different from what I see in inspect element in chrome. I googled and it seems I should use selenium, but I don't know how to use it. This is the code I have and when I see the driver.page_source it is still the source page as in chrome and doesn't look like the source in inspect element.
I really appreciate if someone can help me to fix this.
import os
from selenium import webdriver
chromedriver = "/Users/adam/Downloads/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
driver.get("http://www.tudiabetes.org/forum/users/Bug74/activity")
driver.quit()
It's called XHR.
Your page was loaded from another call, (your url only loads the strcuture of the page, and the meat of the page comes from a different source using XHR, json formatted string) not the pageload it self.
You should really consider using requests and bs4 to query this page instead.
Related
Unfortunately, I am an absolute beginner in the field of web scraping, but I would like to deal with it intensively in the near future. I want to save the data of a table with a Python script in an Excel file, which is also not a problem. However, the source code of the website does not contain any of the values that I would like to have. When examining, the values are entered in the HTML structure, but when I use the XPath, it is output that this is not permitted, that this is not permitted. If I use the Chrome add-on "DataMiner", it can read out the values. How can I achieve this myself in Python? In the picture, the data I want to scrape is shown. Unfortunately, this data is not included in the source code.
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import requests
url = 'https://herakles.webuntis.com/WebUntis/monitor?school=Europaschule%20Gym%20Rhauderfehn&monitorType=subst&format=Test%20Sch%C3%BCler'
from selenium import webdriver
browser = webdriver.Chrome()
browser.get(url)
time.sleep(5)
htmlSource = browser.page_source
print(htmlSource)
Update: The script now prints out the source code, but when searching for an element by the XPath, it still doesn't show anything. As I already said, I'm completely new to Python and web-scraping.
image
here's a version with requests only. you can obtain the payload data from your devtools network tab
import requests
get_url="https://herakles.webuntis.com/WebUntis/monitor?school=Europaschule%20Gym%20Rhauderfehn&monitorType=subst&format=Test%20Sch%C3%BCler"
post_url="https://herakles.webuntis.com/WebUntis/monitor/substitution/data?school=Europaschule Gym Rhauderfehn"
payload={"formatName":"Test Schüler","schoolName":"Europaschule Gym Rhauderfehn","date":20211204,"dateOffset":0,"strikethrough":True,"mergeBlocks":True,"showOnlyFutureSub":True,"showBreakSupervisions":False,"showTeacher":True,"showClass":True,"showHour":True,"showInfo":True,"showRoom":True,"showSubject":True,"groupBy":1,"hideAbsent":True,"departmentIds":[],"departmentElementType":-1,"hideCancelWithSubstitution":True,"hideCancelCausedByEvent":False,"showTime":False,"showSubstText":True,"showAbsentElements":[],"showAffectedElements":[1],"showUnitTime":True,"showMessages":True,"showStudentgroup":False,"enableSubstitutionFrom":True,"showSubstitutionFrom":1600,"showTeacherOnEvent":False,"showAbsentTeacher":True,"strikethroughAbsentTeacher":True,"activityTypeIds":[2,3],"showEvent":True,"showCancel":True,"showOnlyCancel":False,"showSubstTypeColor":False,"showExamSupervision":False,"showUnheraldedExams":False}
with requests.session() as s:
r=s.get(get_url)
s.headers['Content-Type']="application/json;charset=UTF-8"
r=s.post(post_url,json=payload)
print(r.json())
I've tried to get the world population from this website: https://www.worldometers.info/world-population/
but I can only get the html code, not the data of the actual numbers.
I already tried to find children of the object I tried to get data from. I also tried to list the whole object, but nothing seemed to work.
'''just importing stuff '''
import urllib.request
import requests
from bs4 import BeautifulSoup
'''getting html from website to text '''
r = requests.get('https://www.worldometers.info/world-population/')
soup = BeautifulSoup(r.text,'html.parser')
'''here it only finds the one object that's is listed below '''
current_population = soup.find('div',{'class':'maincounter-number'}).find_all('span', recursive=False)
print(current_population)
This is the object the information is stored in:
(span class="rts-counter" rel="current_population">retrieving data... </span>
and in 'inspect-mode' you can see this:
(span class="rts-counter" rel="current_population">(span class="rts-nr-sign"></span>(span class="rts-nr-int rts-nr-10e9">7</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e6">703</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e3">227</span><span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e0">630</span></span>
I always only get the first one, but want to get the second one from 'inspect-mode'.
Here is a picture of the inspect-mode.
You are going to need a method that lets javascript run such as selenium as this number is set up via a counter that is generated in this script: https://www.realtimestatistics.net/rts/RTSp.js
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.worldometers.info/world-population/')
print(d.find_element_by_css_selector('[rel="current_population"]').text)
You could try writing your own version of that javascript script but I wouldn't recommend it.
I didn't need an explicit wait condition for selenium script but that could be added.
The website you are scraping is a JavaScript web app. The element content you see in inspect mode is the result of running some JavaScript code after the page downloads that populates that element. Prior to the JavaScript running, the element only contains the text "retrieving data...", which is what you see in your Python code. Neither the Python requests library nor BeautifulSoup run JavaScript in downloaded HTML -- they only download and parse the HTML, and that is why your code only sees the initial text.
You have two options:
Inspect the JavaScript code or website calls and figure out what HTTP URL the page is calling to retrieve the value it puts into that element. Have your Python code fetch that URL instead and parse the value from the response for that URL.
Use a full browser engine. This StackOverflow answer provides a solution: Web-scraping JavaScript page with Python
Javascript is rendered on the DOM so Beautiful Soup will not work as you want it to.
You will have to make something that lets javascript run(eg: browser) so you can make your own browser using QT4 or the like. Sentdex had a good tutorial on it here:
https://www.youtube.com/watch?v=FSH77vnOGqU
Otherwise, you could use Selenium:
from selenium import webdriver
import time
drive = webdriver.Firefox()
drive.get('https://www.worldometers.info/world-population/')
time.sleep(5)
html = driver.page_source
Let's use the url https://www.google.cl/#q=stackoverflow as an example. Using Chrome Developer Tools on the first link given by the search we see this html code:
Now, if I run this code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = urlopen("https://www.google.cl/#q=stackoverflow")
soup = BeautifulSoup(url)
print(soup.prettify())
I wont find the same elements. In fact, I wont find any link from the results given by the google search. Same goes if I use the requests module. Why does this happen? Can I do something to get the same results as if I was requesting from a web browser?
Since the html is generated dynamically, likely from a modern single page javascript framework like Angular or React (or even just plain JavaScript), you will need to actually drive a browser to the site using selenium or phantomjs before parsing the dom.
Here is some skeleton code.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("http://google.com")
html = driver.execute_script("return document.documentElement.innerHTML")
soup = BeautifulSoup(html)
Here is the selenium documentation for more info on running selenium, configurations, etc.:
http://selenium-python.readthedocs.io/
edit:
you will likely need to add a wait before grabbing the html, since it may take a second or so to load certain elements of the page. See below for reference to the explicity wait documentation of python selenium:
http://selenium-python.readthedocs.io/waits.html
Another source of complication is that certain parts of the page might be hidden until AFTER user interaction. In this case you will need to code your selenium script to interact with the page in certain ways before grabbing the html.
I'm fairly new to Python so excuse me if the problem isn't clear or if the answer is obvious.
I want to scrape the web page http://jassa.fr/. I generated some random input (sequences) and see how it holds against my own data. I tried scraping the page using selenium but the HTML of the webpage doesn't use any id's, and I don't know how to navigate through the DOM without using id's (impossible with selenium?).
Does anyone have any ideas for me how to tackle this problem, especially regarding that I want to scrape the results which are generated server side?
Thanks in advance!
[edit]
Thanks for the quick response!
How do I access this text area using selenium:
< textarea style="border:1px solid #999999;" tabindex="1" name="sequence" cols="70" rows="4" onfocus="if(this.value=='Type or paste your sequence')this.value='';">Type or paste your sequence
Edit: After clarification that you need to access <textarea> with the name sequence I suggest using find_element_by_name, see here for more details on selecting elements in Selenium.
from selenium import webdriver
url = "http://jassa.fr/"
browser = webdriver.Firefox()
browser.get(url)
form = browser.find_element_by_tag_name("form")
sequence = form.find_element_by_name("sequence")
sequence.clear()
sequence.send_keys("ATTTAATTTA")
form.submit()
Selenium has ability to navigate the tree and select elements not only by ID but also by class, tag name, link text and so on (see the docs), but I found myself more comfortable with the following scenario: I use Selenium to grab the webpage content (so the browser renders page with javascript things) and then feed BeautifulSoup with it and navigate it with BeautifulSoup methods. It looks like this:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "http://example.com/"
browser = webdriver.Firefox()
browser.get(url)
page = BeautifulSoup(browser.page_source, "lxml")
# Let's find some tables and then print all their rows
for table in page("table"):
for row in table("tr"):
print(row)
However, I'm not sure that you really need Selenium. The site you are going to parse doesn't seem to rely on JavaScript heavily, so it may me easier just to use simpler solutions like RoboBrowser or MechanicalSoup (or mechanize for python2).
I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.
import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[#class='product-soldout ng-scope']")
at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are
1: Am I correct in my assessment of the problem?
and
2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.
Thanks
Edit: Thanks to both responses. I used Selenium and got my script working.
You are not correct in your assessment of the problem.
You can check the results and see that there's a </html> right near the end. That means you've got the whole page.
And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.
Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.
There are a number of general solutions to that. For example:
Use selenium or similar to drive an actual browser to download the page.
Manually work out what the JavaScript code does and do equivalent work in Python.
Run a headless JavaScript interpreter against a DOM that you've built up.
The page uses javascript to load the table which is not loaded when requests gets the html so you are getting all the html just not what is generated using javascript, you could use selenium combined with phantomjs for headless browsing to get the html:
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get("http://www.anthropologie.eu/anthro/index.jsp#/")
html = browser.page_source
print(html)