Python - Extracting text from HTML element [duplicate] - python

So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!

BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html

I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]

Related

Need help diagnosing Python Attribute error [duplicate]

So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!
BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html
I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]

Web Scrapping html that is commented out [duplicate]

So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!
BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html
I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]

empy list when using find all by class name [duplicate]

So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!
BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html
I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]

Why am I not seeing any kind of extracted data when I run this code [duplicate]

So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!
BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html
I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]

python requests cannot get html

I tried to get html code from a site name dcinside in Korea, i am using requests but cannot get html code
and this is my code
import requests
url = "http://gall.dcinside.com/board/lists/?id=bitcoins&page=1"
req = requests.get(url)
print (req)
print (req.content)
but the result was
Why I cannot get html codes even using requests??
Most likely they are detecting that you are trying to crawl data dynamically, and not giving any content as a response. Try pretending to be a browser and passing some User-Agent headers.
headers = {
'User-Agent': 'My User Agent 1.0',
'From': 'youremail#domain.com'
}
response = requests.get(url, headers=headers)
# use authentic mozilla or chrome user-agent strings if this doesn't work
Take a look at this:
Python Web Crawlers and "getting" html source code
Like the guy said in the aforementioned post, you should use urllib2 which will allow you to easily obtain web resources.

Categories

Resources