Trying to crawl a website but dont get <body>

Trying to crawl a website but dont get <body> - python

I am playing around and want to send myself an email when a new post in on forum thread appears, but when I open the url with urllib.urlopen I get back the webpage but without a page body. Can someone please tell me why this is the case? And how I can get the body?
def loadUrl(adress):
adress = urllib.unquote(adress)
print("Loading " + adress)
socket =urllib.urlopen(adress)
html = socket.read()
socket.close()
soup = BeautifulSoup(html)
return soup
soup = loadUrl("http://de.pokerstrategy.com/forum/thread.php?threadid=498111")

In addition, i would recommend to use Pyquery.
from pyquery import PyQuery
d = PyQuery("http://de.pokerstrategy.com/forum/thread.php?threadid=498111")
print d("body").html()

EDIT sorry, I didn't realise that you had posted the url you were trying to retrieve. I get the same response as you, and aren't sure why. I can't see anything in the javascript, as i had suggested below.
I tested your code and it seems to work fine. Perhaps the page you are trying to retrieve generates the body element via javascript or something similar. In this case I believe you can use something like selenium to emulate the browser.

I've had success using BeautifulSoup with urllib2, for example:
from urllib2 import urlopen
...
html = urlopen(...)
soup = BeautifulSoup(html)

Related

python crawling text from <em></em>

Hi, I want to get the text(number 18) from em tag as shown in the picture above.
When I ran my code, it did not work and gave me only empty list. Can anyone help me? Thank you~
here is my code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://blog.naver.com/kwoohyun761/221945923725'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)

When you disable javascript you'll see that the like count is loaded dynamically, so you have to use a service that renders the website and then you can parse the content.
You can use an API: https://www.scraperapi.com/
Or run your own for example: https://github.com/scrapinghub/splash
EDIT:
First of all, I missed that you were using urlopen incorrectly the correct way is described here: https://docs.python.org/3/howto/urllib2.html . Assuming you are using python3, which seems to be the case judging by the print statement.
Furthermore: looking at the issue again it is a bit more complicated. When you look at the source code of the page it actually loads an iframe and in that iframe you have the actual content: Hit ctrl + u to see the source code of the original url, since the side seems to block the browser context menu.
So in order to achieve your crawling objective you have to first grab the initial page and then grab the page you are interested in:
from urllib.request import urlopen
from bs4 import BeautifulSoup
# original url
url = "https://blog.naver.com/kwoohyun761/221945923725"
with urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
iframe = soup.find('iframe')
# iframe grabbed, construct real url
print(iframe['src'])
real_url = "https://blog.naver.com" + iframe['src']
# do your crawling
with urlopen(real_url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
You might be able to avoid one round trip by analyzing the original url and the URL in the iframe. At first glance it looked like the iframe url can be constructed from the original url.
You'll still need a rendered version of the iframe url to grab your desired value.
I don't know what this site is about, but it seems they do not want to get crawled maybe you respect that.

Follow HTML or PHP redirects with Python?

I want to see the last URL of the website with Python. I'm mostly using requests and urllib2, but everything is welcome.
The website I'm trying isn't giving Response 302. It directly redirects using HTML or maybe PHP.
I used requests module for this, but it seems like it doesn't count HTML PHP redirects as "Redirect".
My current code:
def get_real(domain):
red_domain = requests.get(domain, allow_redirects=True).url
return red_domain
print(get_real("some_url"))
If there is a way to achieve this, how? Thanks in advance!
Posts I checked:
Python follow redirects and then download the page?
Python Requests library redirect new url
Tracking redirection of the request using request history | Packtpub
EDIT: URL I'm trying: http://001.az. It's using HTML to redirect.
HTML Code inside it:
<HTML> <HEAD><META HTTP-EQUIV=Refresh CONTENT="0; url=http://fm.vc"></HEAD> </HTML>

BeautifulSoup can help in detecting HTML Meta redirections:
from bs4 import BeautifulSoup
# use request to extract the HTML text
...
soup = BeautifulSoup(html_text.lower(), "html5lib") # lower because we only want redirections
try:
content = soup.heap.find('meta', {'http-equiv': 'refresh'}).attrs['content']
ix = content.index('url=')
url = content[ix+4:]
# ok, we have to redirect to url
except AttributeError, KeyError, ValueError:
url = None
# if url is not None, loop going there...

Extract the source code of this page using Python (https://mobile.twitter.com/i/bookmarks)

how Extract the source code of this page using Python (https://mobile.twitter.com/i/bookmarks) !
The problem is that the actual page code does not appear
import mechanicalsoup as ms
Browser = ms.StatefulBrowser()
Browser.open("https://mobile.twitter.com/login")
Browser.select_form('form[action="/sessions"]')
Browser["session[username_or_email]"] = 'email'
Browser["session[password]"] = 'password'
Browser.submit_selected()
Browser.open("https://mobile.twitter.com/i/bookmarks")
html = Browser.get_current_page()
print html

Use BeautifulSoup.
from urllib import request
from bs4 import BeautifulSoup
url_1 = "http://www.google.com"
page = request.urlopen(url_1)
soup = BeautifulSoup(page)
print(soup.prettify())
From this answer:
https://stackoverflow.com/a/43290890/11034096

Edit:
It looks like the issue is that Twitter is trying to use a JS redirect to load the next page. JS isn't supported by mechanicalsoup, so you'll need to try something like selenium.
The html variable that you are returning is actually a BeautifulSoup object and not the text HTML. I would try using:
print(html.text())
to see if that will print the HTML directly.
Alternatively, from the BeautifulSoup documentation you should be able to use the non-pretty printing of:
str(html)
or
unicode(html.a)

Beautiful Soup Can't Find Tags

I am currently trying to practice with the requests and BeautifulSoup Modules in Python 3.6 and have run into an issue that I can't seem to find any info on in other questions and answers.
It seems that at some point in the page, Beuatiful Soup stops recognizing tags and Ids. I am trying to pull Play-by-play data from a page like this:
http://www.pro-football-reference.com/boxscores/201609080den.htm
import requests, bs4
source_url = 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
res = requests.get(source_url)
if '404' in res.url:
raise Exception('No data found for this link: '+source_url)
soup = bs4.BeautifulSoup(res.text,'html.parser')
#this works
all_pbp = soup.findAll('div', {'id' : 'all_pbp'})
print(len(all_pbp))
#this doesn't
table = soup.findAll('table', {'id' : 'pbp'})
print(len(table))
Using the inspector in Chrome, I can see that the table definitely exists. I have also tried to use it on 'div's and 'tr's in the later half of the HTML and it doesn't seem to work. I have tried the standard 'html.parser' as well as lxml and html5lib, but nothing seems to work.
Am I doing something wrong here, or is there something in the HTML or its formatting that prevents BeautifulSoup from correctly finding the later tags? I have run into issues with similar pages run by this company (hockey-reference.com, basketball-reference.com), but have been able to use these tools properly on other sites.
If it is something with the HTML, is there any better tool/library for helping to extract this info out there?
Thank you for your help,
BF

BS4 won't be able to execute the javascript of a web page after doing the GET request for a URL. I think that the table of concern is loaded async from client-side javascript.
As a result, the client-side javascript will need to run first before scraping the HTML. This post describes how to do so!

Ok, I got what was the problem. You're trying to parse comment, not an ordinary html element.
For such cases you should use Comment from BeautifulSoup, like this:
import requests
from bs4 import BeautifulSoup,Comment
source_url = 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
res = requests.get(source_url)
if '404' in res.url:
raise Exception('No data found for this link: '+source_url)
soup = BeautifulSoup(res.content,'html.parser')
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
comment=BeautifulSoup(str(comment), 'html.parser')
search_play = comment.find('table', {'id':'pbp'})
if search_play:
play_to_play=search_play

Scraping website in which html is injected with javascript

I am trying to get the url and sneaker titles at https://stockx.com/sneakers.
This is my code so far:
in main.py
from bs4 import BeautifulSoup
from utils import generate_request_header
import requests
url = "https://stockx.com/sneakers"
html = requests.get(url, headers=generate_request_header()).content
soup = BeautifulSoup(html, "lxml")
print soup
in utils.py
def generate_request_header():
header = BASE_REQUEST_HEADER
header["User-Agent"] = random.choice(USER_AGENT_HEADER_LIST)
return header
But whenever I print soup, I get the following output: https://pastebin.com/Ua6B6241. There doesn't seem to be any HTML extracted. How would I get it? Should I be using something like Selenium?

requests doesn't seem to be able to verify the ssl certificates, to temporarily bypass this error, you can use verify=False, i.e.:
requests.get(url, headers=generate_request_header(), verify=False)
To fix it permanently, you may want to read:
http://docs.python-requests.org/en/master/user/advanced/#ssl-cert-verification

I'm guessing the data you're looking for are at line 126 in the pastebin. I've never tried to extract the text of a script but I'm sure it could be done.
In lxml, something like:
source_code.xpath('//script[#type="text/javascript"]') should return a list of all the scripts as objects.
Or to try and get straight to the "tickers":
[i for i in source_code.xpath('//script[#type="text/javascript"]') if 'tickers' in i.xpath('string')]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to crawl a website but dont get <body> - python

In addition, i would recommend to use Pyquery. from pyquery import PyQuery d = PyQuery("http://de.pokerstrategy.com/forum/thread.php?threadid=498111") print d("body").html()

I've had success using BeautifulSoup with urllib2, for example: from urllib2 import urlopen ... html = urlopen(...) soup = BeautifulSoup(html)

Related

python crawling text from <em></em>

Follow HTML or PHP redirects with Python?

Extract the source code of this page using Python (https://mobile.twitter.com/i/bookmarks)

Beautiful Soup Can't Find Tags

Scraping website in which html is injected with javascript

Categories

Resources