crawling web data using python html error

crawling web data using python html error - python

i want to crawling data using python
i tried tried again
but it didn't work
i can not found code's error
i wrote code like this:
import re
import requests
from bs4 import BeautifulSoup
url='http://news.naver.com/main/ranking/read.nhn?mid=etc&sid1=111&rankingType=popular_week&oid=277&aid=0003773756&date=20160622&type=1&rankingSectionId=102&rankingSeq=1'
html=requests.get(url)
#print(html.text)
a=html.text
bs=BeautifulSoup(a,'html.parser')
print(bs)
print(bs.find('span',attrs={"class" : "u_cbox_contents"}))
i want to crawl reply data in news
as you can see, i tried to searing this:
span, class="u_cbox_contents" in bs
but python only say "None"
None
so i check bs using function print(bs)
and i check up bs variable's contents
but there is no span, class="u_cbox_contents"
why this happing?
i really don't know why
please help me
thanks for reading.

Requests will fetch the URL's contents, but will not execute any JavaScript.
I performed the same fetch with cURL, and I can't find any occurrence of u_cbox_contents in the HTML code. Most likely, it's injected using JavaScript, which explains why BeautifulSoup can't find it.
If you need the page's code as it would be rendered in a "normal" browser, you could try Selenium. Also have a look at this SO question.

Related

How can i fix the find_all error while web scraping?

I am kinda a newbie in data world. So i tried to use bs4 and requests to scrap data from trending youtube videos. I have tried using soup.findall() method. To see if it works i displayed it. But it gives me an empty list. Can you help me fix it? Click here to see the spesific part of the html code.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.youtube.com/feed/explore")
soup = BeautifulSoup(r.content,"lxml")
soup.prettify()
trendings = soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-
shelf-contents-renderer"})
print(trending)

This webpage is dynamic and contains scripts to load data. Whenever you make a request using requests.get("https://www.youtube.com/feed/explore"), it loads the initial source code file that only contains information like head, meta, etc, and scripts. In a real-world scenario, you will have to wait until scripts load data from the server. BeautifulSoup does not catch the interactions with DOM via JavaScript. That's why soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-shelf-contents-renderer"}) gives you empty list as there is no ytd-video-renderer tag or style-scope ytd-expanded-shelf-contents-renderer class.
For dynamic webpages, I think you should use Selenium (or maybe Scrapy).
For Youtube, you can use it's API as well.

Is there a way to get information about elements from the inspect menu in a website?

I've tried to get the world population from this website: https://www.worldometers.info/world-population/
but I can only get the html code, not the data of the actual numbers.
I already tried to find children of the object I tried to get data from. I also tried to list the whole object, but nothing seemed to work.
'''just importing stuff '''
import urllib.request
import requests
from bs4 import BeautifulSoup
'''getting html from website to text '''
r = requests.get('https://www.worldometers.info/world-population/')
soup = BeautifulSoup(r.text,'html.parser')
'''here it only finds the one object that's is listed below '''
current_population = soup.find('div',{'class':'maincounter-number'}).find_all('span', recursive=False)
print(current_population)
This is the object the information is stored in:
(span class="rts-counter" rel="current_population">retrieving data... </span>
and in 'inspect-mode' you can see this:
(span class="rts-counter" rel="current_population">(span class="rts-nr-sign"></span>(span class="rts-nr-int rts-nr-10e9">7</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e6">703</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e3">227</span><span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e0">630</span></span>
I always only get the first one, but want to get the second one from 'inspect-mode'.
Here is a picture of the inspect-mode.

You are going to need a method that lets javascript run such as selenium as this number is set up via a counter that is generated in this script: https://www.realtimestatistics.net/rts/RTSp.js
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.worldometers.info/world-population/')
print(d.find_element_by_css_selector('[rel="current_population"]').text)
You could try writing your own version of that javascript script but I wouldn't recommend it.
I didn't need an explicit wait condition for selenium script but that could be added.

The website you are scraping is a JavaScript web app. The element content you see in inspect mode is the result of running some JavaScript code after the page downloads that populates that element. Prior to the JavaScript running, the element only contains the text "retrieving data...", which is what you see in your Python code. Neither the Python requests library nor BeautifulSoup run JavaScript in downloaded HTML -- they only download and parse the HTML, and that is why your code only sees the initial text.
You have two options:
Inspect the JavaScript code or website calls and figure out what HTTP URL the page is calling to retrieve the value it puts into that element. Have your Python code fetch that URL instead and parse the value from the response for that URL.
Use a full browser engine. This StackOverflow answer provides a solution: Web-scraping JavaScript page with Python

Javascript is rendered on the DOM so Beautiful Soup will not work as you want it to.
You will have to make something that lets javascript run(eg: browser) so you can make your own browser using QT4 or the like. Sentdex had a good tutorial on it here:
https://www.youtube.com/watch?v=FSH77vnOGqU
Otherwise, you could use Selenium:
from selenium import webdriver
import time
drive = webdriver.Firefox()
drive.get('https://www.worldometers.info/world-population/')
time.sleep(5)
html = driver.page_source

Router Access - Beautiful Soup - Python 3.5

I have a router that I want to login to and retrieve information using Python script. Im a newbie to Python but want to learn and explore more with it. Here is what I have written so far:
from requests.auth import HTTPBasicAuth
import requests
from bs4 import BeautifulSoup
response = requests.get('http://192.168.1.1/Settings.html/', auth=HTTPBasicAuth('Username', 'Password'))
html = response.content
soup = BeautifulSoup(html, "html.parser")
print (soup.prettify())
I have two questions which are:
When I run the script the first time, I receive an authentication error. On running the script a second time it seems to authenticate fine and retrieve the HTML. Is there a better method?
With BS I want to only retrieve the code I require from the script. I cant see a tag to set BS to scrape. At the start of the HTML there are a list of variables of which I want to scrape the data for example:
var Device Pin = '12345678';
Its much easier to retrieve the information using a single script instead of logging onto the web interface each time. It sits within the script type="text/javascript".
Is BS the correct tool for the job. Can I just scrape the one line in the list of variables?
Any help as always very much appreciatted.

As far as I know, BeautifulSoup does not handle javascript. In this case, it's simple enough to just use regular expressions
import re
m = re.search(r"var Device Pin\s+= '(\d+)'", html)
pin = m.group(1)
Regarding the authentication problem, you can wrap your call in try except to redo the call if it doesn't work the first time.

I'd run a packet sniffer, tcpdump or wireshark, to see the interaction between your script and your router. Viewing the interactions may help determine why you're unable to authenticate on the first pass. As a workaround, run the auth section in a for loop which will try N number of times to authenticate before failing.
Regarding scraping, you may want to consider lxml with the beautiful soup parser so you can use XPath. See can we use xpath with BeautifulSoup?
XPath would allow you easily pull a single value, text, attribute, etc. from the html if lxml can parse it.

Wait until page is fully loaded and then reading its content with urllib2/3 [duplicate]

I'm learning to make web scrapers and want to scrape TripAdvisor for a personal project, grabbing the html using urllib2.
However, I'm running into a problem where, using the code below, the html I get back is not correct as the page seems to take a second to redirect (you can verify this by visiting the url) - instead I get the code from the page that initially briefly appears.
Is there some behavior or parameter to set to make sure the page has completely finished loading/redirecting before getting the website content?
import urllib2
from bs4 import BeautifulSoup
bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
print soup.prettify()
Edit: The answer is thorough, however, in the end what solved my problem was this:
https://stackoverflow.com/a/3210737/1157283

Inreresting the problem isn't a redirect is that page modifies the content using javascript, but urllib2 doesn't have a JS engine it just GETS data, if you disabled javascript on your browser you will note it loads basically the same content as what urllib2 returns
import urllib2
from BeautifulSoup import BeautifulSoup
bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
open('test.html', 'w').write(soup.read())
test.html and disabling JS in your browser, easiest in firefox content -> uncheck enable javascript, generates identical result sets.
So what can we do well, first we should check if the site offers an API, scrapping tends to be frown up
http://www.tripadvisor.com/help/what_type_of_tripadvisor_content_is_available
Travel/Hotel API's?
it looks they might, though with some restrictions.
But if we still need to scrape it, with JS, then we can use selenium http://seleniumhq.org/ its mainly used for testing, but its easy and has fairly good docs.
I also found this Scraping websites with Javascript enabled? and this http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/
hope that helps.
As a side note:
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>>
>>> bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
>>> value = bostonPage.read()
>>> soup = BeautifulSoup(value)
>>> open('test.html', 'w').write(value)

Python Scraping fb comments from a website

I have been trying to scrape facebook comments using Beautiful Soup on the below website pages.
import BeautifulSoup
import urllib2
import re
url = 'http://techcrunch.com/2012/05/15/facebook-lightbox/'
fd = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(fd)
fb_comment = soup("div", {"class":"postText"}).find(text=True)
print fb_comment
The output is a null set. However, I can clearly see the facebook comment is within those above tags in the inspect element of the techcrunch site (I am little new to Python and was wondering if the approach is correct and where I am going wrong?)

Like Christopher and Thiefmaster: it is all because of javascript.
But, if you really need that information, you can still retrieve it thanks to Selenium on http://seleniumhq.org then use beautifulsoup on this output.

Facebook comments are loaded dynamically using AJAX. You can scrape the original page to retrieve this:
<fb:comments href="http://techcrunch.com/2012/05/15/facebook-lightbox/" num_posts="25" width="630"></fb:comments>
After that you need to send a request to some Facebook API that will give you the comments for the URL in that tag.

The parts of the page you are looking for are not included in the source file. Use a browser and you can see this for yourself by opening the page source.
You will need to use something like pywebkitgtk to have the javascript executed before passing the document to BeautifulSoup

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

crawling web data using python html error - python

Related

How can i fix the find_all error while web scraping?

Is there a way to get information about elements from the inspect menu in a website?

Router Access - Beautiful Soup - Python 3.5

Wait until page is fully loaded and then reading its content with urllib2/3 [duplicate]

Python Scraping fb comments from a website

Categories

Resources