How can I scrape data from this table with Python? - python

Unfortunately, I am an absolute beginner in the field of web scraping, but I would like to deal with it intensively in the near future. I want to save the data of a table with a Python script in an Excel file, which is also not a problem. However, the source code of the website does not contain any of the values that I would like to have. When examining, the values are entered in the HTML structure, but when I use the XPath, it is output that this is not permitted, that this is not permitted. If I use the Chrome add-on "DataMiner", it can read out the values. How can I achieve this myself in Python? In the picture, the data I want to scrape is shown. Unfortunately, this data is not included in the source code.
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import requests
url = 'https://herakles.webuntis.com/WebUntis/monitor?school=Europaschule%20Gym%20Rhauderfehn&monitorType=subst&format=Test%20Sch%C3%BCler'
from selenium import webdriver
browser = webdriver.Chrome()
browser.get(url)
time.sleep(5)
htmlSource = browser.page_source
print(htmlSource)
Update: The script now prints out the source code, but when searching for an element by the XPath, it still doesn't show anything. As I already said, I'm completely new to Python and web-scraping.
image

here's a version with requests only. you can obtain the payload data from your devtools network tab
import requests
get_url="https://herakles.webuntis.com/WebUntis/monitor?school=Europaschule%20Gym%20Rhauderfehn&monitorType=subst&format=Test%20Sch%C3%BCler"
post_url="https://herakles.webuntis.com/WebUntis/monitor/substitution/data?school=Europaschule Gym Rhauderfehn"
payload={"formatName":"Test Schüler","schoolName":"Europaschule Gym Rhauderfehn","date":20211204,"dateOffset":0,"strikethrough":True,"mergeBlocks":True,"showOnlyFutureSub":True,"showBreakSupervisions":False,"showTeacher":True,"showClass":True,"showHour":True,"showInfo":True,"showRoom":True,"showSubject":True,"groupBy":1,"hideAbsent":True,"departmentIds":[],"departmentElementType":-1,"hideCancelWithSubstitution":True,"hideCancelCausedByEvent":False,"showTime":False,"showSubstText":True,"showAbsentElements":[],"showAffectedElements":[1],"showUnitTime":True,"showMessages":True,"showStudentgroup":False,"enableSubstitutionFrom":True,"showSubstitutionFrom":1600,"showTeacherOnEvent":False,"showAbsentTeacher":True,"strikethroughAbsentTeacher":True,"activityTypeIds":[2,3],"showEvent":True,"showCancel":True,"showOnlyCancel":False,"showSubstTypeColor":False,"showExamSupervision":False,"showUnheraldedExams":False}
with requests.session() as s:
r=s.get(get_url)
s.headers['Content-Type']="application/json;charset=UTF-8"
r=s.post(post_url,json=payload)
print(r.json())

Related

Is there a way to get information about elements from the inspect menu in a website?

I've tried to get the world population from this website: https://www.worldometers.info/world-population/
but I can only get the html code, not the data of the actual numbers.
I already tried to find children of the object I tried to get data from. I also tried to list the whole object, but nothing seemed to work.
'''just importing stuff '''
import urllib.request
import requests
from bs4 import BeautifulSoup
'''getting html from website to text '''
r = requests.get('https://www.worldometers.info/world-population/')
soup = BeautifulSoup(r.text,'html.parser')
'''here it only finds the one object that's is listed below '''
current_population = soup.find('div',{'class':'maincounter-number'}).find_all('span', recursive=False)
print(current_population)
This is the object the information is stored in:
(span class="rts-counter" rel="current_population">retrieving data... </span>
and in 'inspect-mode' you can see this:
(span class="rts-counter" rel="current_population">(span class="rts-nr-sign"></span>(span class="rts-nr-int rts-nr-10e9">7</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e6">703</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e3">227</span><span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e0">630</span></span>
I always only get the first one, but want to get the second one from 'inspect-mode'.
Here is a picture of the inspect-mode.
You are going to need a method that lets javascript run such as selenium as this number is set up via a counter that is generated in this script: https://www.realtimestatistics.net/rts/RTSp.js
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.worldometers.info/world-population/')
print(d.find_element_by_css_selector('[rel="current_population"]').text)
You could try writing your own version of that javascript script but I wouldn't recommend it.
I didn't need an explicit wait condition for selenium script but that could be added.
The website you are scraping is a JavaScript web app. The element content you see in inspect mode is the result of running some JavaScript code after the page downloads that populates that element. Prior to the JavaScript running, the element only contains the text "retrieving data...", which is what you see in your Python code. Neither the Python requests library nor BeautifulSoup run JavaScript in downloaded HTML -- they only download and parse the HTML, and that is why your code only sees the initial text.
You have two options:
Inspect the JavaScript code or website calls and figure out what HTTP URL the page is calling to retrieve the value it puts into that element. Have your Python code fetch that URL instead and parse the value from the response for that URL.
Use a full browser engine. This StackOverflow answer provides a solution: Web-scraping JavaScript page with Python
Javascript is rendered on the DOM so Beautiful Soup will not work as you want it to.
You will have to make something that lets javascript run(eg: browser) so you can make your own browser using QT4 or the like. Sentdex had a good tutorial on it here:
https://www.youtube.com/watch?v=FSH77vnOGqU
Otherwise, you could use Selenium:
from selenium import webdriver
import time
drive = webdriver.Firefox()
drive.get('https://www.worldometers.info/world-population/')
time.sleep(5)
html = driver.page_source

Scraping text from a website when the text does not appear in the source

I am trying to retrieve the 'Now Playing' information from http://radioplayer.magic.co.uk/live using Python and Beautiful Soup.
I can see the text in a web browser and can copy and paste it so I am assuming this text is downloaded from somewhere, when I look at the page from Beautiful Soup I can't see the text or even where it might be coming from.
I am a beginner at this so please be gentle!
Thanks in advance for sharing your knowledge and experience.
ADDITIONAL INFORMATION: I am using Python 3 on a raspberry pi
The content of Now Playing div is loaded dynamically by making an AJAX request and that is why it is not included in the source page you will received.
What you can do is imitating the ajax request made and fetching the response from there.
This is how you can achieve this :
import requests
import json
main_url = "http://radioplayer.magic.co.uk/live/"
ajax_url = "http://ps1.pubnub.com/subscribe/sub-eff4f180-d0c2-11e1-bee3-1b5222fb6268/np_4/0/14901814159272341?uuid=ef978c6c-2edf-4ff5-910a-39765d038427"
re = requests.get(ajax_url).content
playing_list = json.loads(re)[0]
max_time = 0
playing_now_dict = {}
for playings in playing_list :
if int(playings['start_time']) > max_time :
playing_now_dict = playings
print(playing_now_dict.get('title', ''))
print(playing_now_dict.get('artist', ''))
This currently prints :
Young Hearts Run Free
Candi Staton
It seems like a task for python and selenium: http://selenium-python.readthedocs.io/ (this enables you to control the browser and do whatever you can do manually, e.g. select displayed text)
(Warinng - the Firefox plugin is somewhat picky about the version, last stable version in Ubuntu works only with Firefox up to 45)
If you want to stick to using a headless browser (e.g. urllib, requests) then you will have to monitor the network calls while loading the website and get the exact URI (& necessary form data?) to use in python.
OR you could use python-selenium which will work exactly like the browser. Once you load the page, you can use driver.page_source to parse the source through BeautifulSoup.
Also, if you are lucky, maybe the website has an API (json/xml) that lets you fetch what you want without going through the hassle of parsing the raw source.
Using selenium is usually more difficult to install than to actually use. For example, you could try the following out first on a normal PC:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
url = "http://radioplayer.magic.co.uk/live/"
browser = webdriver.Firefox(firefox_binary=FirefoxBinary())
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')
playlist = soup.find(id='playlist')
print playlist.find('span', class_='artist').text
print playlist.find('span', class_='title').text
This would give you something like:
Level 42
Running In The Family
You will need to investigate which browser driver will be compatible on a Raspberry Pi.

Why does python and my web browser show different codes for the same link?

Let's use the url https://www.google.cl/#q=stackoverflow as an example. Using Chrome Developer Tools on the first link given by the search we see this html code:
Now, if I run this code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = urlopen("https://www.google.cl/#q=stackoverflow")
soup = BeautifulSoup(url)
print(soup.prettify())
I wont find the same elements. In fact, I wont find any link from the results given by the google search. Same goes if I use the requests module. Why does this happen? Can I do something to get the same results as if I was requesting from a web browser?
Since the html is generated dynamically, likely from a modern single page javascript framework like Angular or React (or even just plain JavaScript), you will need to actually drive a browser to the site using selenium or phantomjs before parsing the dom.
Here is some skeleton code.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("http://google.com")
html = driver.execute_script("return document.documentElement.innerHTML")
soup = BeautifulSoup(html)
Here is the selenium documentation for more info on running selenium, configurations, etc.:
http://selenium-python.readthedocs.io/
edit:
you will likely need to add a wait before grabbing the html, since it may take a second or so to load certain elements of the page. See below for reference to the explicity wait documentation of python selenium:
http://selenium-python.readthedocs.io/waits.html
Another source of complication is that certain parts of the page might be hidden until AFTER user interaction. In this case you will need to code your selenium script to interact with the page in certain ways before grabbing the html.

Python: scraping results of webpage of which results are generated server-side

I'm fairly new to Python so excuse me if the problem isn't clear or if the answer is obvious.
I want to scrape the web page http://jassa.fr/. I generated some random input (sequences) and see how it holds against my own data. I tried scraping the page using selenium but the HTML of the webpage doesn't use any id's, and I don't know how to navigate through the DOM without using id's (impossible with selenium?).
Does anyone have any ideas for me how to tackle this problem, especially regarding that I want to scrape the results which are generated server side?
Thanks in advance!
[edit]
Thanks for the quick response!
How do I access this text area using selenium:
< textarea style="border:1px solid #999999;" tabindex="1" name="sequence" cols="70" rows="4" onfocus="if(this.value=='Type or paste your sequence')this.value='';">Type or paste your sequence
Edit: After clarification that you need to access <textarea> with the name sequence I suggest using find_element_by_name, see here for more details on selecting elements in Selenium.
from selenium import webdriver
url = "http://jassa.fr/"
browser = webdriver.Firefox()
browser.get(url)
form = browser.find_element_by_tag_name("form")
sequence = form.find_element_by_name("sequence")
sequence.clear()
sequence.send_keys("ATTTAATTTA")
form.submit()
Selenium has ability to navigate the tree and select elements not only by ID but also by class, tag name, link text and so on (see the docs), but I found myself more comfortable with the following scenario: I use Selenium to grab the webpage content (so the browser renders page with javascript things) and then feed BeautifulSoup with it and navigate it with BeautifulSoup methods. It looks like this:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "http://example.com/"
browser = webdriver.Firefox()
browser.get(url)
page = BeautifulSoup(browser.page_source, "lxml")
# Let's find some tables and then print all their rows
for table in page("table"):
for row in table("tr"):
print(row)
However, I'm not sure that you really need Selenium. The site you are going to parse doesn't seem to rely on JavaScript heavily, so it may me easier just to use simpler solutions like RoboBrowser or MechanicalSoup (or mechanize for python2).

Html code in inspect element differs from html source code

I am trying to crawl a website (with python) and get its users info. But when I download the source of the pages, it is different from what I see in inspect element in chrome. I googled and it seems I should use selenium, but I don't know how to use it. This is the code I have and when I see the driver.page_source it is still the source page as in chrome and doesn't look like the source in inspect element.
I really appreciate if someone can help me to fix this.
import os
from selenium import webdriver
chromedriver = "/Users/adam/Downloads/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
driver.get("http://www.tudiabetes.org/forum/users/Bug74/activity")
driver.quit()
It's called XHR.
Your page was loaded from another call, (your url only loads the strcuture of the page, and the meat of the page comes from a different source using XHR, json formatted string) not the pageload it self.
You should really consider using requests and bs4 to query this page instead.

Categories

Resources