Crawling, can't connect to the site - python

I have a little issue, because when I want to crawle a site i get error like: "HTTP Error 404 not found" I tried some ways to fix it, but it didn't work. I can't connect with the site to get the data.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import urllib.request
my_url="https://tabletennis.setkacup.com/en/schedule?date=2021-08-29&hall=4&period=1"
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0'
headers={'User-Agent':user_agent,}
request = urllib.request.Request(my_url,None,headers)
uClient = uReq(request)

It seems like SSL Error if so look at here.
Or you can try requests library.
pip install requests
import requests
my_url="https://tabletennis.setkacup.com/en/schedule?date=2021-08-29&hall=4&period=1"
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0'
headers={'User-Agent':user_agent,}
response = requests.request(method="GET", url=my_url, headers=headers)
print(response.content)

Related

WebScraping A Website With Json Content Gives Value Error

I am trying to scrape an api call with requests. This is the website
Following Is The Error That It Gives Me:
ValueError: No JSON object could be decoded
Following Is The Code :
import requests
import json
import time
from bs4 import BeautifulSoup
url = 'https://www.nseindia.com/api/event-calendar'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
request = requests.get(url,headers=headers)
data = json.loads(request.text)
print(data)
How Can I Scrape This Website ?
Try this:
import requests
from bs4 import BeautifulSoup
url = 'https://www.nseindia.com/companies-listing/corporate-filings-event-calendar'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
request = requests.get(url,headers=headers)
soup = BeautifulSoup(request.text,'html.parser')
print(soup)
The table is probably being dynamically generated with Javascript. Therefore, requests won't work. You need selenium and a headless browser to do that.

How to access the Cambridge dictionary website with Python?

Sorry, I'm a newbie. I need to access this website with Python https://dictionary.cambridge.org
This is what I try:
from urllib import *
url = 'https://dictionary.cambridge.org/dictionary/english/flower'
print (request.urlopen(url).read())
This is what I get:
File "D:\Anaconda\lib\http\client.py", line 275, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
RemoteDisconnected: Remote end closed connection without response
Can you share any ideas how I can access this website?
Thanks a lot!
Solution
url = 'https://dictionary.cambridge.org/dictionary/english/flower'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
headers = {'User-Agent': user_agent}
req = urllib.request.Request(url, headers)
with urllib.request.urlopen(req) as response:
html = response.read()
print(html)
Explanation
Currently the connection is being terminated because of no headers in the request.
The code below talks about the same way you can request to cambridge.
note that (user_agent) will vary depending on the version, you can go to cambridge then F12 on windows to get the corresponding user_agent, hope it helps you.
from bs4 import BeautifulSoup
import requests
url = 'https://dictionary.cambridge.org/dictionary/french-english/bonjour'
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36"
headers = {'User-Agent': user_agent}
web_request = requests.get(url, headers=headers)
soup = BeautifulSoup(web_request.text, "html.parser")
//do somthing

Response [412] when using the requests python package to access this webpage, how to get around it?

This is the reproducible code:
import requests
url = 'http://wjw.hubei.gov.cn/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'}
res = requests.get(url,headers=headers)
print(res)
The code print(res) gives the following output:
<Response [412]>
I can open the webpage fine on my computer with Chrome.
Is there something missing in the header? Is there a way to get around the 412 error? Thanks in advance!
That website require a valid Cookie in order to response back to you.
I've tried several ways such as calling the main website and then retrieving the Cookie under requests.Session() but the website is not allowing me to pass through.
So the only way which you can use as for now. Or to use Selenium or pass a valid Cookie to the requests
Here's how to get the Cookie and User-Agent via the browser:
Using the following Code:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0",
"Cookie": "Hm_lvt_5544783ae3e1427d6972d9e77268f25d=1578572654; Hm_lpvt_5544783ae3e1427d6972d9e77268f25d=1578572671; dataHide2=64fa0f2a-a6aa-43b4-adf0-ce901e8d1a37; FSSBBIl1UgzbN7N80S=sXE0qXcyGkTm4uVerLqfZyUU3XFMZzkm22k.eqVABLPe0eYMo3D8uX5ZJ07.7cCr; FSSBBIl1UgzbN7N80T=4aY.P74ZFvDef6i1BgsPAGpjsGOCcIHJFaOyshl4_fJ1WvTk1nqBkdG9PsyX3VRZcIuI8zdYiRJw4rEBQfx.Mv.GS_wT6Hzgiw.AY.UMP.Mw4iCKXGDzY1UeIH2gUd15impxzBVzZpN3MnSdqD0TUqcxSq0RrvIuE8RKT5pFLAqaNnVqtbeSACx43yIYtKJ41y8Isu6a6lNOlWNeaFJ8bx22pKm3lAIO.HIDhGSZqrUP76.q3i4Iux59f7dqJPuSRF90G1LSUBE8t8HrlWzBcSwJJJARX4Ioc0iHmHvdkVoigUitTRjLUHJM4ieOV1sLBDsq"
}
r = requests.get("http://wjw.hubei.gov.cn/", headers=headers)
print(r)
Output:
<Response [200]>
Update:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0"}
with requests.Session() as req:
r = req.get("http://www.hubei.gov.cn/")
headers['Cookie'] = r.headers.get("Set-Cookie")
for item in range(10):
new = req.get("http://wjw.hubei.gov.cn/", headers=headers)
print(new)
import requests
response=requests.get("https://precog.iiit.ac.in/")
< Response [200] >
<Response [400]>
<Response [800]>
None of the above responses

Getting 200 response but not logging in using python requests library

I have tried to scrape twitter account followers list. For that, authentication is required. So i used requests library for authentication purpose. The problem i am getting is, when i try to authenticate, I am getting 200 response but authentication is not done. The code is:
import requests
from bs4 import BeautifulSoup
import json
payload={
"session[username_or_email]":"*****************",
"session[password]":"**********************",
"authenticity_token":"aa3520020157738bdabb6d60f2e02894c6c85689",
"ui_metrics":'{"rf":{"a67dd0828000993f688a64a8238f647dd8ef987feb0db5979725fc7e304c3989":-250,"a4cd98aa5fd1d026bfded286fc24eb6ac9cf01a65b789ade51b68558cb0f6ae0":-21,"a88c7b5bdeb04ce3cf55df08c0f981f99df760b9348680c735fbff5b60ad054f":51,"a5e59c69fb04ab30f2f8468030c31ca1150f4265e4c2a35dbb1b67b85be6954f":-68},"s":"QdcvZJ9RhjLcVcW2N_pDt5j5AKQJCkqnh9caYV5ykW35tRpQc_RN5s_VefN2uVCONpXf-qZa-fr8VtCAFrtiOf2f6PhloU2GyxLDN38wGppFNWhb4psCr7x-kibioS9PDxWZF1pe3FM-MOz9YtIQrWxbmEAWnRTK3gUn-1nv4kTFDa159YxJoXiYt43g41sRUJWezJI2yJaECnO1ARbkNAPKrMndxRAcq_5qSFpT8CqzEUvBKPMdFMKeUrzeEecqmx632lTV1NlucVIvV9co3Y3Rk7CtURoaiCwsjTED1brU4XAY3VwsTEuNRUYZqirRNZrYQBCHqsMh5FV_UHpO2QAAAWE40pmN"}',
"scribe_log":"",
"redirect_after_login":"",
"authenticity_token":"aa3520020157738bdabb6d60f2e02894c6c85689",
"return_to_ssl":"",
"remember_me":"1",
"lang":"",
"redirect":""
}
headers={
"accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-encoding":"gzip, deflate, br",
"accept-language":"en-US,en;q=0.9",
"cache-control":"max-age=0",
"cookie":'moments_profile_moments_nav_tooltip_self=true; syndication_guest_id=v1%3A150345116906281638; eu_cn=1; kdt=QErLcBT9OjM5gjEznmsRcHlMTK6biDyAw4gfI5ro; _ga=GA1.2.1923324433.1496571570; tfw_exp=0; __utma=43838368.1923324433.1496571570.1516764481.1516764481.1; __utmz=43838368.1516764481.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); remember_checked_on=0; personalization_id="v1_Iq7dc3Mq746/e91mchhhJg=="; guest_id=v1%3A151698504007256847; ads_prefs="HBERAAA="; _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCF5%252F0jhhAToMY3NyZl9p%250AZCIlN2ZmZjExM2NkYjUzODEzZDNiNDE4YWI3NGRhZTAxOTc6B2lkIiU3YWFl%250AZjVhNDY1OWJlNzdiN2RiYjEzNjIwYWVjMGMyMQ%253D%253D--d69792331ec3a3b6c9d994a07f2159bfd5697089; ct0=ecc095f3a61b1c77279538584cb6f20e; _gid=GA1.2.253357133.1517076775; _gat=1',
"referer":"https://twitter.com/login",
"upgrade-insecure-requests":"1",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
str=payload["ui_metrics"]
x=json.dumps(str)
y=json.loads(str)
payload["ui_metrics"]=y
res = requests.post("https://twitter.com/login",data=payload,headers=headers)
r = requests.get("https://twitter.com/following")
soup = BeautifulSoup(r.text,"html.parser")
print(res.status_code)
print(r.url)
print(soup.prettify())
for item in soup.find_all({"class":"u-textInheritColor js-nav"}):
print(item.text)
I am getting 200 response for status code. How to solve this problem?
NOTE: I am not using any APIs. I want to authenticate using requests library.
Try this. It should get you there:
import requests
from bs4 import BeautifulSoup
with requests.Session() as s:
r = s.get("https://twitter.com/login")
soup = BeautifulSoup(r.text,"lxml")
token = soup.select_one("[name='authenticity_token']")['value']
payload={
'session[username_or_email]':'your_email',
'session[password]':'your_password',
'authenticity_token':token,
'ui_metrics':'{"rf":{"c6fc1daac14ef08ff96ef7aa26f8642a197bfaad9c65746a6592d55075ef01af":3,"a77e6e7ab2880be27e81075edd6cac9c0b749cc266e1cea17ffc9670a9698252":-1,"ad3dbab6c68043a1127defab5b7d37e45d17f56a6997186b3a08a27544b606e8":252,"ac2624a3b325d64286579b4a61dd242539a755a5a7fa508c44eb1c373257d569":-125},"s":"fTQyo6c8mP7d6L8Og_iS8ulzPObBOzl3Jxa2jRwmtbOBJSk4v8ClmBbF9njbZHRLZx0mTAUPsImZ4OnbZV95f-2gD6-03SZZ8buYdTDkwV-xItDu5lBVCQ_EAiv3F5EuTpVl7F52FTIykWowpNIzowvh_bhCM0_6ReTGj6990294mIKUFM_mPHCyZxkIUAtC3dVeYPXff92alrVFdrncrO8VnJHOlm9gnSwTLcbHvvpvC0rvtwapSbTja-cGxhxBdekFhcoFo8edCBiMB9pip-VoquZ-ddbQEbpuzE7xBhyk759yQyN4NmRFwdIjjedWYtFyOiy_XtGLp6zKvMjF8QAAAWE468LY"}',
'scribe_log':'',
'redirect_after_login':'',
'authenticity_token':token,
'remember_me':1
}
headers={
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'content-type':'application/x-www-form-urlencoded',
'origin':'https://twitter.com',
'referer':'https://twitter.com/login',
'upgrade-insecure-requests':'1',
'user-agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
res = s.post("https://twitter.com/sessions",data=payload,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".tweet-text"):
print(item.text)

Python3, beautifulsoup, return nothing in specific pages

In some pages, when I use beautifulsoup, return nothing...just blank pages.
from bs4 import BeautifulSoup
import urllib.request
Site = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
URL = Site
html = urllib.request.urlopen(URL).read()
soup = BeautifulSoup(html, "html.parser")
print(soup)
I can use beautifulsoup any other site except this site. and I dont know way...
This URL will require certain headers passed while requesting.
Pass this headers parameter while requesting the URL and you will get the HTML.
HTML = requests.get(URL , headers = headers).content
while
headers = {
"method":"GET",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
"Host":"gall.dcinside.com",
"Pragma":"no-cache",
"Upgrade-Insecure-Requests":"1",
"Accept":"text/html,application/xhtml+xml,
application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
}
As I can see, this site is using cookies. You can see the headers in the browser's developer tool. You can get the cookie by following:
import urllib.request
r = urllib.request.urlopen(URL)
ck = r.getheader('Set-Cookie')
Now you can create the header like this and send it with subsequent requests.
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Cookie": ck,
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"
}
req = urllib.request.Request(URL, headers=headers)
html = urllib.request.urlopen(req).read()
Some website servers look for robot scripts trying to access their pages. One of the simpler methods of doing this is to check to see which User-Agent is being sent by the browser. In this case as you are using Python and not a web browser, the following is being sent:
python-requests/2.18.4
When it sees an agent it does not like, it will return nothing. To get around this, you need to change the User-Agent string in your request. There are hundreds to choose from, as the agent string changes with each release of a browser. For example see this list of Firefox User-Agent strings e.g.
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0
The trick is to try a few, and find one that the server is happy with. In your case, ONLY the header needs to be changed in order to get HTML to be returned from the website. In some cases, cookies will also need to be used.
The header can be easily changed by passing a dictionary. This could be done using requests as follows:
from bs4 import BeautifulSoup
import requests
url = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
html = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'}).content
soup = BeautifulSoup(html, "html.parser")
print(soup)

Categories

Resources