I've been trying to print time data from this site: clockofeidolon.com and I found that the hour, minutes and seconds are stored in "span class="big-x"
tags and have tried to get the data with this
from bs4 import BeautifulSoup
from requests import Session
session = Session()
session.headers['user-agent'] = (
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/'
'66.0.3359.181 Safari/537.36'
)
url = 'https://clockofeidolon'
response = session.get(url=url)
data = response.text
soup = BeautifulSoup(data, "html.parser")
spans = soup.find('<span class="big')
print(data)
print([span.text for span in spans])
I keep getting authentication erros though
socket.gaierror: [Errno 11001] getaddrinfo failed
This error is occuring because you are trying to access an URL that doesn't exist (https://clockofeidolon) or Python can't reach.
Look at this question, which explains what that error means:
"getaddrinfo failed", what does that mean?
The host clockofeidolon did not resolve to an IP. You were probably looking for clockofeidolon.com.
Related
I am new to Python and Web scraping but it's been two weeks that I periodically scrape one website and successfully download images from it. I use different proxies and sometimes change them. But starting yesterday all my proxies suddenly stopped working with a timeout error. I've tried a whole list of them and all fail.
Could this be a kind of site protection from scraping? If yes, is there a way to overcome it?
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
proxies = {
"http": "http://188.114.99.153",
"https": "http://180.94.69.66:8080"
}
url = 'https://parovoz.com/newgallery/index.php?&LNG=RU&NO_ICONS=0&CATEG=-1&HOWMANY=192'
html = requests.get(url, headers=header, proxies=proxies, timeout=10).text
soup = BeautifulSoup(html, 'lxml')
Error message:
ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001536A8E7190>, 'Connection to 180.94.69.66 timed out. (connect timeout=10)'))
This will GET the URL and retry 3 times in case of ConnectTimeoutError. It will help to apply delays between attempts to avoid failing again in case of periodic request quota.
Take a look at urllib3.util.retry.Retry, it has many options to simplify retries.
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://parovoz.com/newgallery/index.php?&LNG=RU&NO_ICONS=0&CATEG=-1&HOWMANY=192'
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
html = session.get(url, headers=header).text
soup = BeautifulSoup(html, 'lxml')
print(soup)
I am trying to scrape data from a website but it shows this error. I don't know how to fix this.
b'<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>'
This is my code
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
page = requests.get(url).content
page
Output
You need to add user-agent and it works.
If you do not put user-agent of some browser, the site thinks that you are bot and block you.
from bs4 import BeautifulSoup
import requests
url = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"}
page = requests.get(url, headers=headers).content
print(page)
I am trying to access https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%2050. It is working fine from my localhost (code compiled in vscode) but when I deploy it on the server I get HTTP 499 error.
Did anybody get through this and was able to fetch the data using this approach?
Looks like nse is blocking the request somehow. But then how is it working from a localhost?
P.S. - I am a paid user of pythonAnywhere (Hacker) subscription
import requests
import time
def marketDatafn(query):
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36'}
main_url = "https://www.nseindia.com/"
session = requests.Session()
response = session.get(main_url, headers=headers)
cookies = response.cookies
url = "https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%2050"
nifty50DataReq = session.get(url, headers=headers, cookies=cookies, timeout=15)
nifty50DataJson = nifty100DataReq.json()
return nifty50DataJson['data']
Actually "Pythonanywhere" only supports those website which are in this whitelist.
And I have found that there are only two subdomain available under "nseindia.com", which is not that you are trying to request.
bricsonline.nseindia.com
bricsonlinereguat.nseindia.com
So, pythonanywhere is blocking you to sent request to that website.
Here's the link to read more about how to request to add your website there.
I had a script that would bypass a logon page that looks like this
URL="http://mywebsite.com/logon.aspx"
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"}
username="username"
password="password"
s = Session()
s.verify = False
s.headers.update(headers)
r = s.get(URL)
soup=BeautifulSoup(r.content,"html.parser")
VIEWSTATE = soup.find(id="__VIEWSTATE")['value']
VIEWSTATEGENERATOR = soup.find(id="__VIEWSTATEGENERATOR")['value']
EVENTVALIDATION = soup.find(id="__EVENTVALIDATION")['value']
login_data={"__VIEWSTATE":VIEWSTATE,
"__VIEWSTATEGENERATOR":VIEWSTATEGENERATOR,
"__EVENTVALIDATION":EVENTVALIDATION,
"txtUsername":username,
"txtPassword":password,
"btnLogin":"Login"
}
#r = s.post(URL, data=login_data, verify=False)
r = s.post("http://mywebsite.com/logon.aspx", data=login_data)
r = s.get("http://mywebsite.com/SummaryReport/Index")
that script was working fine before but then it started running into SSL errors so I changed it so that verify=false for the session
Now I don't get SSL errors but now it won't post the data to logon page, I'm not sure if it is related or not but any help is much appreciated
If this is the SSL error you are seeing its a warning and can be ignored.
/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host 'ownwebsite.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning,
What's failing is the following lines.
VIEWSTATE = soup.find(id="__VIEWSTATE")['value']
The content received from the URL doesn't have the expected id __VIEWSTATE and is returning None and trying to access based of key 'value' is causing the error.
TypeError: 'NoneType' object is not subscriptable
import urllib.request
from bs4 import BeautifulSoup
page = urllib.request.urlopen("https://www.google.com/search?sxsrf=ACYBGNTOhiadhX5wH-HLBzUmxJSBAPzpbQ%3A1574342044444&source=hp&ei=nI3WXbq4GMWGoASf-I2oAw&q=%EB%A6%AC%EB%B2%84%ED%92%80+&oq=%EB%A6%AC%EB%B2%84%ED%92%80+&gs_l=psy-ab.3..35i39j0l9.463.2481..2802...2.0..1.124.1086.0j10......0....1..gws-wiz.....10..0i131j0i10j35i362i39.ciJHtFLjhCA&ved=0ahUKEwi69r6SsfvlAhVFA4gKHR98AzUQ4dUDCAY&uact=5#sie=t;/m/04ltf;2;/m/02_tc;mt;fp;1;;").read()
soup = BeautifulSoup(page,'html.parser')
I try to get a football game schedule from Google and this error occurs. What's the reason?
rank = soup.find('table',{'class':'imspo_mt__mit'})
print(rank)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Google has blocked you from accessing the page, that's what the 403 error is.
Try spoofing a user agent? The following works for me:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
page = requests.get("https://www.google.com/search?sxsrf=ACYBGNTOhiadhX5wH-HLBzUmxJSBAPzpbQ%3A1574342044444&source=hp&ei=nI3WXbq4GMWGoASf-I2oAw&q=%EB%A6%AC%EB%B2%84%ED%92%80+&oq=%EB%A6%AC%EB%B2%84%ED%92%80+&gs_l=psy-ab.3..35i39j0l9.463.2481..2802...2.0..1.124.1086.0j10......0....1..gws-wiz.....10..0i131j0i10j35i362i39.ciJHtFLjhCA&ved=0ahUKEwi69r6SsfvlAhVFA4gKHR98AzUQ4dUDCAY&uact=5#sie=t;/m/04ltf;2;/m/02_tc;mt;fp;1;;", headers=user_agent)
soup = BeautifulSoup(page.text,'html.parser')
rank = soup.find('table',{'class':'imspo_mt__mit'})
print(rank)