I wanted to get the HTML of web site but I cant get it due to the user agent I suppose. Because when I call uClient=ureq(my_url) I get an error like this: urllib.error.HTTPError: HTTP Error 403: Forbidden
This is the code:
from urllib.request import urlopen as ureq, Request
from bs4 import BeautifulSoup as soup
my_url= 'https://hsreplay.net/meta/#tab=matchups&sortBy=winrate'
ureq(Request(my_url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'}))
uClient=ureq(my_url)
page_html=uClient.read()
uClient.close()
html=soup(page_html,"html.parser")
I have tried other methods of changing th user agent and other user agents, but it isn't work.
I'm pretty sure you will help. Thanks!!
What you did above is clearly a mess. The code should not run at all. Try the below way instead.
from bs4 import BeautifulSoup
from urllib.request import Request,urlopen
URL = "https://hsreplay.net/meta/#tab=matchups&sortBy=winrate"
req = Request(URL,headers={"User-Agent":"Mozilla/5.0"})
res = urlopen(req).read()
soup = BeautifulSoup(res,"lxml")
name = soup.find("h1").text
print(name)
Output:
HSReplay.net
Btw, you can scrape few items that are not javascript encrypted from that page. However, the core content of that page are generated dynamically so you can't grab them using urllib and BeautifulSoup. To get them you need to choose any browser simulator like selenium etc.
Related
I am trying to scrape data from a Twitter webpage using Python but instead of getting the data back, I keep getting "Javascript is not available". I've enabled Javascript in my browser(chrome) but nothing changes.
Here is the error -->
<h1>JavaScript is not available.</h1>
<p>We’ve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using twitter.com. You can see a list of supported browsers in our Help Center.</p>
Here is the code -->
from bs4 import BeautifulSoup
import requests
url = "https://twitter.com/search?q=%23developer%20advocate&src=typed_query&f=user"
source_code = requests.get(url).text
soup = BeautifulSoup(source_code, "lxml")
content = soup.find("div")
print(content)
I've tried enabling Javascript in my browser(chrome), I expected to get the required data back instead the error "Javascript is not availble" persists.
I will never advise scraping twitter by violating their policies, you should use an API instead! But for the Javascript part, just pass user agent in headers in your requests.
from bs4 import BeautifulSoup
import requests
user_agent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
headers = {'User-Agent': user_agent}
url = "https://twitter.com/search?q=%23developer%20advocate&src=typed_query&f=user"
source_code = requests.get(url, headers=headers).text
soup = BeautifulSoup(source_code, "lxml")
content = soup.find("div")
print(content)
I'm trying to code an Instagram-webscraper in Python to return values like a person's followers, the number of posts etc.
Let's just take Google's Instagram-account for this example.
Here is my code:
import requests
from bs4 import BeautifulSoup
link = requests.get("https://www.instagram.com/google")
soup = BeautifulSoup(link.text, "html.parser")
print(soup)
print(link.status_code)
Pretty straightforward.
However, if I run the code, it prints link.status_code = 429. It should be 200, for any other website it prints 200.
Also, when it prints soup, it doesnt show what I actually want. Not the HTML for the account is shown, but the HTML for the Instagram-Error-page.
Why does requests open the instagram error page, not the account from the link provided?
To get correct response from the server, set User-Agent HTTP header:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}
link = requests.get("https://www.instagram.com/google", headers=headers)
soup = BeautifulSoup(link.text, "lxml")
print(link.status_code)
print(soup.select_one('meta[name="description"]')["content"])
Prints:
200
12.5m Followers, 33 Following, 1,642 Posts - See Instagram photos and videos from Google (#google)
I'm trying to automate a login using python's requests module, but whenever I use the POST or GET request the server sends 403 status code; the weird part is that I can access that same URL with any browser but it just won't work with curl and requests.
here is the code:
import requests
import lxml
from bs4 import BeautifulSoup
import os
url = "https://ais.usvisa-info.com/en-am/niv/users/sign_in"
req = requests.get(url).text
soup = BeautifulSoup(req, 'lxml')
ready = soup.prettify()
FILE = open("usvisa.html", "w")
FILE.write(ready)
FILE.close()
I'd appreciate any help or idea!
Its probably the /robots.txt, thats blocking you.
try overriding the user-agent with a custom one.
import requests
import lxml
from bs4 import BeautifulSoup
import os
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36"}
url = "https://ais.usvisa-info.com/en-am/niv/users/sign_in"
req = requests.get(url, headers=headers).text
soup = BeautifulSoup(req, 'lxml')
ready = soup.prettify()
FILE = open("usvisa.html", "w", encoding="utf-8")
FILE.write(ready)
FILE.close()
you also didnt specify the file encoding when opening a file.
import requests
from bs4 import BeautifulSoup
user_agent = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
page = requests.get("https://www.whoscored.com/Regions/252/Tournaments/2/England-Premier-League", headers=user_agent)
soup = BeautifulSoup(page.text,'html.parser')
print(soup)
I'm trying to do webcrawling on 'whoscored.com' but I can't get all the HTML Tell me the solution.
Request unsuccessful. Incapsula incident ID: 946001050011236585-61439481461474967
this is result.
from selenium import webdriver
import time
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
url = 'https://www.whoscored.com/Regions/252/Tournaments/2/England-Premier-League'
sada = browser.get(url)
time.sleep(3)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')
print(soup.prettify())
There is a couple of issues here. The root cause is that the website you are trying to scrape knows you're not a real person and is blocking you. Lots of websites do this simply by checking headers to see if a request is coming from a browser or not (robot). However, this site looks like they use Incapsula, which is designed to provide more sophisticated protection
I am currently trying to reproduce a web scraping example with Beautiful Soup. However, I have to say I find it pretty unintuitive, which of course might alse be due to lack of experience. In case anyone could help me with an example I'd appreciate it. I cannot find much relevant information online. I would like to extract the first value (Dornum) of the following website: http://flow.gassco.no/
I only got this far:
import requests
page = requests.get("http://flow.gassco.no/")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, 'html.parser')
Thank you in advance!
Another way is to use current requests module.
You can pass user-agent like this:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 4.4.2; Nexus 4 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.114 Mobile Safari/537.36'
}
page = requests.get("http://flow.gassco.no/", headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
EDIT: To make this version work straightforward you can make a workaround with browser sessions.
You need to pass with requests.get a cookie that tells the site a session number, where Terms and Conditions are already accepted.
Run this code:
import requests
from bs4 import BeautifulSoup
url = "http://flow.gassco.no"
s = requests.Session()
r = s.get(url)
action = BeautifulSoup(r.content, 'html.parser').find('form').get('action') #this gives a "tail" of url whick indicates acceptance of Terms
s.get(url+action)
page = s.get(url).content
soup = BeautifulSoup(page, 'html.parser')
You need to learn how to use urllib, urllib2 first.
Some website shield spiders.
something like:
urllib2.request.add_header('User-Agent','Mozilla/5.0 (Linux; Android 4.4.2; Nexus 4 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.114 Mobile Safari/537.36')
Let website think you are Browser, not robot.