So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!
BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html
I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]
Related
So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!
BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html
I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]
So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!
BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html
I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]
So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!
BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html
I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]
So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!
BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html
I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]
I tried to get html code from a site name dcinside in Korea, i am using requests but cannot get html code
and this is my code
import requests
url = "http://gall.dcinside.com/board/lists/?id=bitcoins&page=1"
req = requests.get(url)
print (req)
print (req.content)
but the result was
Why I cannot get html codes even using requests??
Most likely they are detecting that you are trying to crawl data dynamically, and not giving any content as a response. Try pretending to be a browser and passing some User-Agent headers.
headers = {
'User-Agent': 'My User Agent 1.0',
'From': 'youremail#domain.com'
}
response = requests.get(url, headers=headers)
# use authentic mozilla or chrome user-agent strings if this doesn't work
Take a look at this:
Python Web Crawlers and "getting" html source code
Like the guy said in the aforementioned post, you should use urllib2 which will allow you to easily obtain web resources.