How to access the Cambridge dictionary website with Python? - python

Sorry, I'm a newbie. I need to access this website with Python https://dictionary.cambridge.org
This is what I try:
from urllib import *
url = 'https://dictionary.cambridge.org/dictionary/english/flower'
print (request.urlopen(url).read())
This is what I get:
File "D:\Anaconda\lib\http\client.py", line 275, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
RemoteDisconnected: Remote end closed connection without response
Can you share any ideas how I can access this website?
Thanks a lot!

Solution
url = 'https://dictionary.cambridge.org/dictionary/english/flower'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
headers = {'User-Agent': user_agent}
req = urllib.request.Request(url, headers)
with urllib.request.urlopen(req) as response:
html = response.read()
print(html)
Explanation
Currently the connection is being terminated because of no headers in the request.

The code below talks about the same way you can request to cambridge.
note that (user_agent) will vary depending on the version, you can go to cambridge then F12 on windows to get the corresponding user_agent, hope it helps you.
from bs4 import BeautifulSoup
import requests
url = 'https://dictionary.cambridge.org/dictionary/french-english/bonjour'
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36"
headers = {'User-Agent': user_agent}
web_request = requests.get(url, headers=headers)
soup = BeautifulSoup(web_request.text, "html.parser")
//do somthing

Related

Getting the same response different URL

I'm getting the same response from these 2 URLs:
First URL
Second URL
This is the code I'm using:
import requests
url = "https://www.amazon.it/blackfriday"
querystring = {"ref_":"nav_cs_gb_td_bf_dt_cr","deals-widget":"{\"version\":1,\"viewIndex\":60,\"presetId\":\"deals-collection-all-deals\",\"sorting\":\"BY_SCORE\"}"}
payload = ""
headers = {"cookie": "session-id=260-4643637-2647537; session-id-time=2082787201l; i18n-prefs=EUR; ubid-acbit=258-7747562-7485655; session-token=%22aZB70z2dnXHbhJ9e02ESp7q6xO23IGnDFT2iBCiPXZFoBTTEguAJ%2FBSnV7ud6bjAca64nh3bMF1bwDykOBf9BV%2BVjbx4tUQCyBkrg8tyR8PLZ8cjzpCz%2FzQSAmjiL6mSBcspkF8xuV0bxqLeRX7JQCMrHVBFf%2BsUhxV%2FMBLCH8UPk2o5aNL7OyAFCODBdRqm72RK5DAoKeMUymlVEOtqzvZSJbP%2Fut0gobiXJblRM2c%3D%22"}
response = requests.request("GET", url, data=payload, headers=headers, params=querystring)
I would like to get the same response that i get on the browser
How can i do it? Why does this happen?
You have to trick the server into thinking you are a browser. You can accomplish this by setting the user agent header.
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36',"cookie": "session-id=260-4643637-2647537; session-id-time=2082787201l; i18n-prefs=EUR; ubid-acbit=258-7747562-7485655; session-token=%22aZB70z2dnXHbhJ9e02ESp7q6xO23IGnDFT2iBCiPXZFoBTTEguAJ%2FBSnV7ud6bjAca64nh3bMF1bwDykOBf9BV%2BVjbx4tUQCyBkrg8tyR8PLZ8cjzpCz%2FzQSAmjiL6mSBcspkF8xuV0bxqLeRX7JQCMrHVBFf%2BsUhxV%2FMBLCH8UPk2o5aNL7OyAFCODBdRqm72RK5DAoKeMUymlVEOtqzvZSJbP%2Fut0gobiXJblRM2c%3D%22"}

Connection timeouts as a protection from site scraping?

I am new to Python and Web scraping but it's been two weeks that I periodically scrape one website and successfully download images from it. I use different proxies and sometimes change them. But starting yesterday all my proxies suddenly stopped working with a timeout error. I've tried a whole list of them and all fail.
Could this be a kind of site protection from scraping? If yes, is there a way to overcome it?
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
proxies = {
"http": "http://188.114.99.153",
"https": "http://180.94.69.66:8080"
}
url = 'https://parovoz.com/newgallery/index.php?&LNG=RU&NO_ICONS=0&CATEG=-1&HOWMANY=192'
html = requests.get(url, headers=header, proxies=proxies, timeout=10).text
soup = BeautifulSoup(html, 'lxml')
Error message:
ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001536A8E7190>, 'Connection to 180.94.69.66 timed out. (connect timeout=10)'))
This will GET the URL and retry 3 times in case of ConnectTimeoutError. It will help to apply delays between attempts to avoid failing again in case of periodic request quota.
Take a look at urllib3.util.retry.Retry, it has many options to simplify retries.
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://parovoz.com/newgallery/index.php?&LNG=RU&NO_ICONS=0&CATEG=-1&HOWMANY=192'
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
html = session.get(url, headers=header).text
soup = BeautifulSoup(html, 'lxml')
print(soup)

Response [412] when using the requests python package to access this webpage, how to get around it?

This is the reproducible code:
import requests
url = 'http://wjw.hubei.gov.cn/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'}
res = requests.get(url,headers=headers)
print(res)
The code print(res) gives the following output:
<Response [412]>
I can open the webpage fine on my computer with Chrome.
Is there something missing in the header? Is there a way to get around the 412 error? Thanks in advance!
That website require a valid Cookie in order to response back to you.
I've tried several ways such as calling the main website and then retrieving the Cookie under requests.Session() but the website is not allowing me to pass through.
So the only way which you can use as for now. Or to use Selenium or pass a valid Cookie to the requests
Here's how to get the Cookie and User-Agent via the browser:
Using the following Code:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0",
"Cookie": "Hm_lvt_5544783ae3e1427d6972d9e77268f25d=1578572654; Hm_lpvt_5544783ae3e1427d6972d9e77268f25d=1578572671; dataHide2=64fa0f2a-a6aa-43b4-adf0-ce901e8d1a37; FSSBBIl1UgzbN7N80S=sXE0qXcyGkTm4uVerLqfZyUU3XFMZzkm22k.eqVABLPe0eYMo3D8uX5ZJ07.7cCr; FSSBBIl1UgzbN7N80T=4aY.P74ZFvDef6i1BgsPAGpjsGOCcIHJFaOyshl4_fJ1WvTk1nqBkdG9PsyX3VRZcIuI8zdYiRJw4rEBQfx.Mv.GS_wT6Hzgiw.AY.UMP.Mw4iCKXGDzY1UeIH2gUd15impxzBVzZpN3MnSdqD0TUqcxSq0RrvIuE8RKT5pFLAqaNnVqtbeSACx43yIYtKJ41y8Isu6a6lNOlWNeaFJ8bx22pKm3lAIO.HIDhGSZqrUP76.q3i4Iux59f7dqJPuSRF90G1LSUBE8t8HrlWzBcSwJJJARX4Ioc0iHmHvdkVoigUitTRjLUHJM4ieOV1sLBDsq"
}
r = requests.get("http://wjw.hubei.gov.cn/", headers=headers)
print(r)
Output:
<Response [200]>
Update:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0"}
with requests.Session() as req:
r = req.get("http://www.hubei.gov.cn/")
headers['Cookie'] = r.headers.get("Set-Cookie")
for item in range(10):
new = req.get("http://wjw.hubei.gov.cn/", headers=headers)
print(new)
import requests
response=requests.get("https://precog.iiit.ac.in/")
< Response [200] >
<Response [400]>
<Response [800]>
None of the above responses

How can I use url lib using a json file

I'm trying to get data from a json link, but I'm getting this error: TypeError: can't concat str to bytes
This is my code:
l = "https://www.off---white.com/en/IT/men/products/omch016f18d471431088s"
url = (l+".json"+"?porcoiddio")
req = urllib.request.Request(url, headers)
response = urllib.request.urlopen(req)
size_opts = json.loads(response.decode('utf-8'))['available_sizes']
How can I solve this error?
Your question answer is change your code to:
size_opts = json.loads(response.read().decode('utf-8'))['available_sizes']
Change at 2018-10-02 22:55 : I view your source code and found Response 503 , the reason why you got 503 is that request did not contain cookies:
req = urllib.request.Request(url, headers=headers)
you have update your headers.
headers.update({"Cookie":cookie_value})
req = urllib.request.Request(url, headers=headers) # !!!! you need a headers include cookies !!!!
you are providing the data argument by mistake …
you'll have to use a keyword argument for headers as otherwise the second argument will be filled with positional input, which happens to be data, try this:
req = urllib.request.Request(url, headers=headers)
See https://docs.python.org/3/library/urllib.request.html#urllib.request.Request for a documentation of Requests signature.
You could have a go using requests instead?
import requests, json
l = "https://www.off---white.com/en/IT/men/products/omch016f18d471431088s"
url = (l+".json"+"?porcoiddio")
session = requests.Session()
session.mount('http://', requests.adapters.HTTPAdapter(max_retries=10))
size_opts = session.get(url, headers= {'Referer': 'off---white.com/it/IT/login', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}).json()['available_sizes']
To check the response:
size_opts = session.get(url, headers= {'Referer': 'off---white.com/it/IT/login', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'})
print(size_opts)
Gives
<Response [503]>
This response means: "503 Service Unavailable. The server is currently unable to handle the request due to a temporary overload or scheduled maintenance"
I would suggest the problem isn't the code but the server?

Python3, beautifulsoup, return nothing in specific pages

In some pages, when I use beautifulsoup, return nothing...just blank pages.
from bs4 import BeautifulSoup
import urllib.request
Site = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
URL = Site
html = urllib.request.urlopen(URL).read()
soup = BeautifulSoup(html, "html.parser")
print(soup)
I can use beautifulsoup any other site except this site. and I dont know way...
This URL will require certain headers passed while requesting.
Pass this headers parameter while requesting the URL and you will get the HTML.
HTML = requests.get(URL , headers = headers).content
while
headers = {
"method":"GET",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
"Host":"gall.dcinside.com",
"Pragma":"no-cache",
"Upgrade-Insecure-Requests":"1",
"Accept":"text/html,application/xhtml+xml,
application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
}
As I can see, this site is using cookies. You can see the headers in the browser's developer tool. You can get the cookie by following:
import urllib.request
r = urllib.request.urlopen(URL)
ck = r.getheader('Set-Cookie')
Now you can create the header like this and send it with subsequent requests.
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Cookie": ck,
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"
}
req = urllib.request.Request(URL, headers=headers)
html = urllib.request.urlopen(req).read()
Some website servers look for robot scripts trying to access their pages. One of the simpler methods of doing this is to check to see which User-Agent is being sent by the browser. In this case as you are using Python and not a web browser, the following is being sent:
python-requests/2.18.4
When it sees an agent it does not like, it will return nothing. To get around this, you need to change the User-Agent string in your request. There are hundreds to choose from, as the agent string changes with each release of a browser. For example see this list of Firefox User-Agent strings e.g.
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0
The trick is to try a few, and find one that the server is happy with. In your case, ONLY the header needs to be changed in order to get HTML to be returned from the website. In some cases, cookies will also need to be used.
The header can be easily changed by passing a dictionary. This could be done using requests as follows:
from bs4 import BeautifulSoup
import requests
url = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
html = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'}).content
soup = BeautifulSoup(html, "html.parser")
print(soup)

Categories

Resources