Python webscraping - realtime data - python

I am trying scrape the live data at the to of this page:
https://www.wallstreet-online.de/devisen/euro-us-dollar-eur-usd-kurs/realtime
My current method:
import time
import re
import bs4 from bs4 import BeautifulSoup as soup
import requests
while (1==1):
con = requests.request('get','https://www.wallstreet-
online.de/devisen/euro-us-dollar-eur-usd-kurs/realtime', stream = True)
page = con.text
kursSoup = soup(page, "html.parser")
kursDiv = kursSoup.find("div", {"class":"pull-left quoteValue"})
print(kursDiv.span)
del con
del page
del kursSoup
del kursDiv
#time.sleep(2)
print("end")
works but is not in sync with the data on the website. I dont really get why because i delete all the variables at the end of the loop so the result should change when the data on the website changes but seems to stay the same for a fixed amount of times. Does anyone know why or has a better way of doing this (Im a bloody beginner and have no idea how the site even works thats why im parsing the html).

It looks like that web page may be using JavaScript to populate and update that number. I'm not familiar with BeautifulSoup but I don't think it will run the JavaScript on the page to update that number.
You may want to use something like Chrome Developer Tools to keep an eye on the network tab. I looked and it looks like there is a websocket connection to wss://push.wallstreet-online.de/lightstreamer going on behind the scenes. You may want to use a websocket client Python library to read from this socket and either find some API docs or reverse engineer the data that comes from the socket. Good luck!

Related

How can i fix the find_all error while web scraping?

I am kinda a newbie in data world. So i tried to use bs4 and requests to scrap data from trending youtube videos. I have tried using soup.findall() method. To see if it works i displayed it. But it gives me an empty list. Can you help me fix it? Click here to see the spesific part of the html code.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.youtube.com/feed/explore")
soup = BeautifulSoup(r.content,"lxml")
soup.prettify()
trendings = soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-
shelf-contents-renderer"})
print(trending)
This webpage is dynamic and contains scripts to load data. Whenever you make a request using requests.get("https://www.youtube.com/feed/explore"), it loads the initial source code file that only contains information like head, meta, etc, and scripts. In a real-world scenario, you will have to wait until scripts load data from the server. BeautifulSoup does not catch the interactions with DOM via JavaScript. That's why soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-shelf-contents-renderer"}) gives you empty list as there is no ytd-video-renderer tag or style-scope ytd-expanded-shelf-contents-renderer class.
For dynamic webpages, I think you should use Selenium (or maybe Scrapy).
For Youtube, you can use it's API as well.

Trying to read data from War Thunder local host with python

Basically I'm using python to send serial data to an arduino so that I can make moving dials using data from the game. This would work because you can use the url "localhost:8111" to give you a list of these stats when ingame. The problem is I'm using urllib and BeautifulSoup but they seem to be blindly reading the source code not giving the data I need.
The data I need comes up when I inspect the element of that page. Other pages seem to suggest that using something to run the HTML in python would fix this but I have found no way of doing this. Any help here would be great thanks.
Not the poster but i have been working on this with him. We managed to get it working. In case anyone else is having this problem here is the code that got it to display our speed
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get("http://localhost:8111")
time.sleep(1)
while True:
elements = driver.find_element_by_id("stt-IAS, km/h")
print(elements.text)
Don't know why the time.sleep is needed but the code doesn't seem to work without it.
Your problem might be that the page elements are Dynamic. (Revealed by JavaScript for example)
Why is this a problem? A: You can't access those tags or data. You'll have to use either a headless/Automated browser ( Learn more about selenium ).
Then make a session through selenium and keep feeding the data the way you wanted to the Arduino.
Summary: If you inspect elements you can see the tag, if you go to view source you cant see it. This can't be solved using bs4 or requests alone. You'll have to use a module called Selenium or something similar.
Here is a Python module that you can use to get all air vehicle telemetry data from War Thunder localhost server pages "indicators" and "status". The contents of each of these pages are static JSON descriptions of the vehicle's current telemetry values.
The Python package uses the requests module to query the localhost server for the data, converts the returned JSON data into dictionaries, and then consolidates all the data into a singular telemetry dictionary. This data can then be used for other Python processes such as datalogging or graphing.

Python search and scrape results

This is my first post so I apologize if it is a duplicate but I could not find an answer relevant to mine. If there is one please let me know and I will check it out.
I am attempting to scrape a website(below) to find Berkeley rent ceiling, the trouble I'm having is I cannot seem to figure out how to insert an address into the search box and scrape the info from the next page. In the past the URLs I've worked with change with search input, but not on this website. I thought my best bet would be using bs4 to scrape the info and request.session and requests.post to get to each subsequent address.
#Berkeley Rent Scrape
from bs4 import BeauitfulSoup
import sys
import requests
import openpyxl
import pprint
import csv
#wb = openpyxl.load_workbook('workbook.xlsx', data_only=True)
#sheet = wb.get_sheet_by_name('worksheet')
props_payload={'aspnetForm':'1150 Oxford St'}
URL = 'http://www.ci.berkeley.ca.us/RentBoardUnitSearch.aspx'
s = requests.session()
p = s.post(ULR, data = props_payload)
soup = BeauitfulSoup(p.text)
data = soup.find_all('td', class="gridItem")
UPDATE How do you get the info from the new webpage once the post has been sent? Or in other words, what is framework for using a request.post then a request.get or bs4 scrape when the URL does not change?
I was thinking it would look something like this, but I'm sure I need a GET request somewhere in there but don't understand how sessions work when the URL doesn't change.
I will be exporting the info to a csv file and to a excel sheet, but I can deal with that later. Just want to get the meat out of the way.
Thank you for any help!
As you can see in the link this search works not through the redirection, so you can't pass your query into the URL.
I'm not sure how you can work directly with the ASP.NET backend (it might be tricky due to authentication/validation on the backend).
I think some automation (test) tool can help you (e.g PhantomJS and/or CasperJS). It gives you control over the rendered web page and you can programmatically put query into the input and grab data after response

extract data from website using python

I recently started learning python and one of the first projects I did was to scrap updates from my son's classroom web page and send me notifications that they updated the site. This turned out to be an easy project so I wanted to expand on this and create a script that would automatically check if any of our lotto numbers hit. Unfortunately I haven't been able to figure out how to get the data from the website. Here is one of my attempts from last night.
from bs4 import BeautifulSoup
import urllib.request
webpage = "http://www.masslottery.com/games/lottery/large-winningnumbers.html"
websource = urllib.request.urlopen(webpage)
soup = BeautifulSoup(websource.read(), "html.parser")
span = soup.find("span", {"id": "winning_num_0"})
print (span)
Output is here...
<span id="winning_num_0"></span>
The output listed above is also what I see if I "view source" with a web browser. When I "inspect Element" with the web browser I can see the winning numbers in the inspect element panel. Unfortunately I'm not even sure how/where the web browser is getting the data. is it loading from another page or a script in the background? I thought the following tutorial was going to help me but I wasn't able to get the data using similar commands.
http://zevross.com/blog/2014/05/16/using-the-python-library-beautifulsoup-to-extract-data-from-a-webpage-applied-to-world-cup-rankings/
Any help is appreciated.
Thanks
If you look closely at the source of the page (I just used curl) you can see this block
<script type="text/javascript">
// <![CDATA[
var dataPath = '../../';
var json_filename = 'data/json/games/lottery/recent.json';
var games = new Array();
var sessions = new Array();
// ]]>
</script>
That recent.json stuck out like a sore thumb (I actually missed the dataPath part at first).
After giving that a try, I came up with this:
curl http://www.masslottery.com/data/json/games/lottery/recent.json
Which, as lari points out in the comments, is way easier than scraping HTML. This easy, in fact:
import json
import urllib.request
from pprint import pprint
websource = urllib.request.urlopen('http://www.masslottery.com/data/json/games/lottery/recent.json')
data = json.loads(websource.read().decode())
pprint(data)
data is now a dict, and you can do whatever kind of dict-like things you'd like to do with it. And good luck ;)

Router Access - Beautiful Soup - Python 3.5

I have a router that I want to login to and retrieve information using Python script. Im a newbie to Python but want to learn and explore more with it. Here is what I have written so far:
from requests.auth import HTTPBasicAuth
import requests
from bs4 import BeautifulSoup
response = requests.get('http://192.168.1.1/Settings.html/', auth=HTTPBasicAuth('Username', 'Password'))
html = response.content
soup = BeautifulSoup(html, "html.parser")
print (soup.prettify())
I have two questions which are:
When I run the script the first time, I receive an authentication error. On running the script a second time it seems to authenticate fine and retrieve the HTML. Is there a better method?
With BS I want to only retrieve the code I require from the script. I cant see a tag to set BS to scrape. At the start of the HTML there are a list of variables of which I want to scrape the data for example:
var Device Pin = '12345678';
Its much easier to retrieve the information using a single script instead of logging onto the web interface each time. It sits within the script type="text/javascript".
Is BS the correct tool for the job. Can I just scrape the one line in the list of variables?
Any help as always very much appreciatted.
As far as I know, BeautifulSoup does not handle javascript. In this case, it's simple enough to just use regular expressions
import re
m = re.search(r"var Device Pin\s+= '(\d+)'", html)
pin = m.group(1)
Regarding the authentication problem, you can wrap your call in try except to redo the call if it doesn't work the first time.
I'd run a packet sniffer, tcpdump or wireshark, to see the interaction between your script and your router. Viewing the interactions may help determine why you're unable to authenticate on the first pass. As a workaround, run the auth section in a for loop which will try N number of times to authenticate before failing.
Regarding scraping, you may want to consider lxml with the beautiful soup parser so you can use XPath. See can we use xpath with BeautifulSoup?
XPath would allow you easily pull a single value, text, attribute, etc. from the html if lxml can parse it.

Categories

Resources