Router Access - Beautiful Soup - Python 3.5 - python

I have a router that I want to login to and retrieve information using Python script. Im a newbie to Python but want to learn and explore more with it. Here is what I have written so far:
from requests.auth import HTTPBasicAuth
import requests
from bs4 import BeautifulSoup
response = requests.get('http://192.168.1.1/Settings.html/', auth=HTTPBasicAuth('Username', 'Password'))
html = response.content
soup = BeautifulSoup(html, "html.parser")
print (soup.prettify())
I have two questions which are:
When I run the script the first time, I receive an authentication error. On running the script a second time it seems to authenticate fine and retrieve the HTML. Is there a better method?
With BS I want to only retrieve the code I require from the script. I cant see a tag to set BS to scrape. At the start of the HTML there are a list of variables of which I want to scrape the data for example:
var Device Pin = '12345678';
Its much easier to retrieve the information using a single script instead of logging onto the web interface each time. It sits within the script type="text/javascript".
Is BS the correct tool for the job. Can I just scrape the one line in the list of variables?
Any help as always very much appreciatted.

As far as I know, BeautifulSoup does not handle javascript. In this case, it's simple enough to just use regular expressions
import re
m = re.search(r"var Device Pin\s+= '(\d+)'", html)
pin = m.group(1)
Regarding the authentication problem, you can wrap your call in try except to redo the call if it doesn't work the first time.

I'd run a packet sniffer, tcpdump or wireshark, to see the interaction between your script and your router. Viewing the interactions may help determine why you're unable to authenticate on the first pass. As a workaround, run the auth section in a for loop which will try N number of times to authenticate before failing.
Regarding scraping, you may want to consider lxml with the beautiful soup parser so you can use XPath. See can we use xpath with BeautifulSoup?
XPath would allow you easily pull a single value, text, attribute, etc. from the html if lxml can parse it.

Related

How can i fix the find_all error while web scraping?

I am kinda a newbie in data world. So i tried to use bs4 and requests to scrap data from trending youtube videos. I have tried using soup.findall() method. To see if it works i displayed it. But it gives me an empty list. Can you help me fix it? Click here to see the spesific part of the html code.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.youtube.com/feed/explore")
soup = BeautifulSoup(r.content,"lxml")
soup.prettify()
trendings = soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-
shelf-contents-renderer"})
print(trending)
This webpage is dynamic and contains scripts to load data. Whenever you make a request using requests.get("https://www.youtube.com/feed/explore"), it loads the initial source code file that only contains information like head, meta, etc, and scripts. In a real-world scenario, you will have to wait until scripts load data from the server. BeautifulSoup does not catch the interactions with DOM via JavaScript. That's why soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-shelf-contents-renderer"}) gives you empty list as there is no ytd-video-renderer tag or style-scope ytd-expanded-shelf-contents-renderer class.
For dynamic webpages, I think you should use Selenium (or maybe Scrapy).
For Youtube, you can use it's API as well.

Url request does not parse every information in HTML using Python

I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:
data = requests.get(url,time.sleep(15)).text
I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.
The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?
I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!
There could be a few reasons for this.
The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.
You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.
Code (taken from here)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
html_source = browser.page_source
With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively,
once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.

How to print a div data-reactid?

I'm doing a project in my spare time where I have hit a problem with getting data from a webpage into the program.
This is my current code:
import urllib
import re
htmlfile = urllib.urlopen("http://www.superliga.dk/klub/aab?sub=squad")
htmltext = htmlfile.read()
regex = r'<div data-reactid=".3.$squad content.0.$=11:0.0.0.0.1:0.2.0.0">([^<]*)</div>'
pattern = re.compile(regex)
goal = re.findall(pattern,htmltext)
print goal
And it's working okay except this part:
regex = r'<div data-reactid=".3.$squad content.0.$=11:0.0.0.0.1:0.2.0.0">([^<]*)</div>'
I can't make it display all values on the webpage with this reactid, and I can't find any solution to this problem.
Any suggestions to how I can get Python to print it?
You are trying to match a tag you saw on the on the developer console of you browser, right?
Unfortunately the html you saw is only the "final form" of a dynamic page: what you did download with urlopen is only the skeleton of the webpage, which in the browser is then dynamically filled with other elements by the javascript using data fetched from some backend server.
If you try to print the actual value stored in htmltest you will find nothing like what you are trying to match with the regex, and that's because it missed all the further processing normally performed by by the javascript.
What you can try to do is to monitor (through the dev console) the fetched resources and reverse-engineer the API call in order to recover the desired info. Chances are the response of these API call is in JSON format or has a structure way more easily parsable than the html body.
UPDATE: for example, in Chrome's dev tools I can see async calls like:
http://ss2.tjekscores.dk/pro-stats/tournaments/46/top-players?sortBy=eventsStats.goals&limit=5&skip=0&positionId=&q=&seasonId=10392&teamId[]=8470
Maybe this returns the info you are looking for.

crawling web data using python html error

i want to crawling data using python
i tried tried again
but it didn't work
i can not found code's error
i wrote code like this:
import re
import requests
from bs4 import BeautifulSoup
url='http://news.naver.com/main/ranking/read.nhn?mid=etc&sid1=111&rankingType=popular_week&oid=277&aid=0003773756&date=20160622&type=1&rankingSectionId=102&rankingSeq=1'
html=requests.get(url)
#print(html.text)
a=html.text
bs=BeautifulSoup(a,'html.parser')
print(bs)
print(bs.find('span',attrs={"class" : "u_cbox_contents"}))
i want to crawl reply data in news
as you can see, i tried to searing this:
span, class="u_cbox_contents" in bs
but python only say "None"
None
so i check bs using function print(bs)
and i check up bs variable's contents
but there is no span, class="u_cbox_contents"
why this happing?
i really don't know why
please help me
thanks for reading.
Requests will fetch the URL's contents, but will not execute any JavaScript.
I performed the same fetch with cURL, and I can't find any occurrence of u_cbox_contents in the HTML code. Most likely, it's injected using JavaScript, which explains why BeautifulSoup can't find it.
If you need the page's code as it would be rendered in a "normal" browser, you could try Selenium. Also have a look at this SO question.

Scraping the comments section of a Web page

I am attempting to scrape the comments counter from a Web page. The code is presented below.
When I ask it to print letters, the output is an empty list. Why that is happening?
import urllib2
from bs4 import BeautifulSoup
r2 = urllib2.urlopen("http://www.ign.com/articles/2016/01/03/steam-surpasses-12-million-concurrent-users").read()
soup2 = BeautifulSoup(r2)
letters = soup2.find_all("div",class_="fyre-comment-count")
print letters
The list is empty because there are no comments on that page. div#livefyre-comment is empty, and div.fyre-comment-count does not exist.
Up in the page's header, there is a suspicious script tag pulling JavaScript from http://cdn.livefyre.com/Livefyre.js. I don't know what Livefyre is, but I assume it sucks comments out of a database somewhere and inserts them into div#livefyre-comment or its surrounding div.article-comments. Presumably, div.fyre-comment-count will also appear somewhere in the DOM once the script is done.
This sort of... design decision is increasingly common on Web sites. To see what a Web page really looks like, browse it with both JavaScript and cookies off (and be prepared for the occasional "500 Internal Server Error" from sites that never imagined such hooliganism was possible).
I don't know enough about screen scraping to tell you where to go from here. You might be able to piece together a URL to fetch the comments (and their count) directly from Livefyre. I'd start by perusing the JavaScript functions they provide, and the data-settings attribute of div#livefyre-comment, which appears to be a JSON dictionary full of relevant parameters.
Your code is very close, almost right. You just missed a few things. Check the code below.
import urlparse
from bs4 import BeautifulSoup
import urllib2
r2 = urllib2.urlopen("http://www.ign.com/articles/2016/01/03/steam-surpasses-12-million-concurrent-users").read()
soup = BeautifulSoup(r2, 'html.parser')
for line in soup.find_all("div",class_="fyre-comment-count"):
comments = ''.join(line.find_all(text=True))
print (comments)

Categories

Resources