How can I use url lib using a json file - python

I'm trying to get data from a json link, but I'm getting this error: TypeError: can't concat str to bytes
This is my code:
l = "https://www.off---white.com/en/IT/men/products/omch016f18d471431088s"
url = (l+".json"+"?porcoiddio")
req = urllib.request.Request(url, headers)
response = urllib.request.urlopen(req)
size_opts = json.loads(response.decode('utf-8'))['available_sizes']
How can I solve this error?

Your question answer is change your code to:
size_opts = json.loads(response.read().decode('utf-8'))['available_sizes']
Change at 2018-10-02 22:55 : I view your source code and found Response 503 , the reason why you got 503 is that request did not contain cookies:
req = urllib.request.Request(url, headers=headers)
you have update your headers.
headers.update({"Cookie":cookie_value})
req = urllib.request.Request(url, headers=headers) # !!!! you need a headers include cookies !!!!

you are providing the data argument by mistake …
you'll have to use a keyword argument for headers as otherwise the second argument will be filled with positional input, which happens to be data, try this:
req = urllib.request.Request(url, headers=headers)
See https://docs.python.org/3/library/urllib.request.html#urllib.request.Request for a documentation of Requests signature.

You could have a go using requests instead?
import requests, json
l = "https://www.off---white.com/en/IT/men/products/omch016f18d471431088s"
url = (l+".json"+"?porcoiddio")
session = requests.Session()
session.mount('http://', requests.adapters.HTTPAdapter(max_retries=10))
size_opts = session.get(url, headers= {'Referer': 'off---white.com/it/IT/login', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}).json()['available_sizes']
To check the response:
size_opts = session.get(url, headers= {'Referer': 'off---white.com/it/IT/login', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'})
print(size_opts)
Gives
<Response [503]>
This response means: "503 Service Unavailable. The server is currently unable to handle the request due to a temporary overload or scheduled maintenance"
I would suggest the problem isn't the code but the server?

Related

Getting the same response different URL

I'm getting the same response from these 2 URLs:
First URL
Second URL
This is the code I'm using:
import requests
url = "https://www.amazon.it/blackfriday"
querystring = {"ref_":"nav_cs_gb_td_bf_dt_cr","deals-widget":"{\"version\":1,\"viewIndex\":60,\"presetId\":\"deals-collection-all-deals\",\"sorting\":\"BY_SCORE\"}"}
payload = ""
headers = {"cookie": "session-id=260-4643637-2647537; session-id-time=2082787201l; i18n-prefs=EUR; ubid-acbit=258-7747562-7485655; session-token=%22aZB70z2dnXHbhJ9e02ESp7q6xO23IGnDFT2iBCiPXZFoBTTEguAJ%2FBSnV7ud6bjAca64nh3bMF1bwDykOBf9BV%2BVjbx4tUQCyBkrg8tyR8PLZ8cjzpCz%2FzQSAmjiL6mSBcspkF8xuV0bxqLeRX7JQCMrHVBFf%2BsUhxV%2FMBLCH8UPk2o5aNL7OyAFCODBdRqm72RK5DAoKeMUymlVEOtqzvZSJbP%2Fut0gobiXJblRM2c%3D%22"}
response = requests.request("GET", url, data=payload, headers=headers, params=querystring)
I would like to get the same response that i get on the browser
How can i do it? Why does this happen?
You have to trick the server into thinking you are a browser. You can accomplish this by setting the user agent header.
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36',"cookie": "session-id=260-4643637-2647537; session-id-time=2082787201l; i18n-prefs=EUR; ubid-acbit=258-7747562-7485655; session-token=%22aZB70z2dnXHbhJ9e02ESp7q6xO23IGnDFT2iBCiPXZFoBTTEguAJ%2FBSnV7ud6bjAca64nh3bMF1bwDykOBf9BV%2BVjbx4tUQCyBkrg8tyR8PLZ8cjzpCz%2FzQSAmjiL6mSBcspkF8xuV0bxqLeRX7JQCMrHVBFf%2BsUhxV%2FMBLCH8UPk2o5aNL7OyAFCODBdRqm72RK5DAoKeMUymlVEOtqzvZSJbP%2Fut0gobiXJblRM2c%3D%22"}

can't find the right compression for this webpage (python requests.get)

I can load this webpage in Google Chrome, but I can't access it via requests. Any idea what the compression problem is?
Code:
import requests
url = r'https://www.huffpost.com/entry/sean-hannity-gutless-tucker-carlson_n_60d5806ae4b0b6b5a164633a'
headers = {'Accept-Encoding':'gzip, deflate, compress, br, identity'}
r = requests.get(url, headers=headers)
Result:
ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))
Use a user agent that emulates a browser:
import requests
url = r'https://www.huffpost.com/entry/sean-hannity-gutless-tucker-carlson_n_60d5806ae4b0b6b5a164633a'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
r = requests.get(url, headers=headers)
You're getting a 403 Forbidden error, which you can see using requests.head. Use RJ's suggestion to defeat huffpost's robot blocking.
>>> requests.head(url)
<Response [403]>

How to access the Cambridge dictionary website with Python?

Sorry, I'm a newbie. I need to access this website with Python https://dictionary.cambridge.org
This is what I try:
from urllib import *
url = 'https://dictionary.cambridge.org/dictionary/english/flower'
print (request.urlopen(url).read())
This is what I get:
File "D:\Anaconda\lib\http\client.py", line 275, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
RemoteDisconnected: Remote end closed connection without response
Can you share any ideas how I can access this website?
Thanks a lot!
Solution
url = 'https://dictionary.cambridge.org/dictionary/english/flower'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
headers = {'User-Agent': user_agent}
req = urllib.request.Request(url, headers)
with urllib.request.urlopen(req) as response:
html = response.read()
print(html)
Explanation
Currently the connection is being terminated because of no headers in the request.
The code below talks about the same way you can request to cambridge.
note that (user_agent) will vary depending on the version, you can go to cambridge then F12 on windows to get the corresponding user_agent, hope it helps you.
from bs4 import BeautifulSoup
import requests
url = 'https://dictionary.cambridge.org/dictionary/french-english/bonjour'
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36"
headers = {'User-Agent': user_agent}
web_request = requests.get(url, headers=headers)
soup = BeautifulSoup(web_request.text, "html.parser")
//do somthing

Web scraping a JSON file with requests

I'm trying to retrieve the timetable from this site using Requests.
I make the post sending the right parameters and get back the empty HTML skeleton, but instead I would like to get the json file returned.
Here is what I see when inspecting the page and highlighted you can see the file I want to retrieve.
Here is my code so far:
url = "https://alilauro-tickets.certusonline.com/"
headers = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.3'}
data = "msg=TimeTable&req=%7B%22getAvailability%22%3A%22Y%22%2C%22getBasicPrice%22%3A%22Y%22%2C%22getRouteAnalysis%22%3A%22Y%22%2C%22directOnly%22%3A%22Y%22%2C%22legs%22%3A%224%22%2C%22pax%22%3A1%2C%22origin%22%3A%22BEV%22%2C%22destination%22%3A%22ISC%22%2C%22tripRequest%22%3A%5B%7B%22tripfrom%22%3A%22BEV%22%2C%22tripto%22%3A%22ISC%22%2C%22tripdate%22%3A%222020-03-19%22%2C%22tripleg%22%3A0%7D%2C%7B%22tripfrom%22%3A%22ISC%22%2C%22tripto%22%3A%22BEV%22%2C%22tripdate%22%3A%222020-03-19%22%2C%22tripleg%22%3A1%7D%2C%7B%22tripfrom%22%3A%22BEV%22%2C%22tripto%22%3A%22FOR%22%2C%22tripdate%22%3A%222020-03-19%22%2C%22tripleg%22%3A2%7D%2C%7B%22tripfrom%22%3A%22FOR%22%2C%22tripto%22%3A%22BEV%22%2C%22tripdate%22%3A%222020-03-19%22%2C%22tripleg%22%3A3%7D%5D%7D"
r = requests.post(url, data=data, headers=headers, timeout=20)
The request should be as below:
url = 'https://alilauro-tickets.certusonline.com/php/proxy.php'
headers = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.3'}
data = {
'msg': 'TimeTable',
'req': '{"getAvailability":"Y","getBasicPrice":"Y","getRouteAnalysis":"Y","directOnly":"Y","legs":1,"pax":1,"origin":"BEV","destination":"FOR","tripRequest":[{"tripfrom":"BEV","tripto":"FOR","tripdate":"2020-03-18","tripleg":0}]}'
}
response = requests.post(url, headers=headers, data=data)

Python requests: CSRF token missing or incorrect

I have a simple HTML page where I am trying to post form data using requests.post(); however, I keep getting Bad Request 400. CSRF token missing or incorrect even though I am passing it URL-encoded.
Please help.
url = "https://recruitment.advarisk.com/tests/scraping"
res = requests.get(url)
tree = etree.HTML(res.content)
csrf = tree.xpath('//input[#name="csrf_token"]/#value')[0]
postData = dict(csrf_token=csrf, ward=wardName)
print(postData)
postUrl = urllib.parse.quote(csrf)
formData = dict(csrf_token=postUrl, ward=wardName)
print(formData)
headers = {'referer': url, 'content-type': 'application/x-www-form-urlencoded', 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
page = requests.post(url, data=formData, headers=headers)
return page.content
You have make sure the requests in one session, so that the csrf_token will be matched:
import sys
import requests
wardName = "DHANLAXMICOMPLEX"
url = 'https://recruitment.advarisk.com/tests/scraping'
#make the requests in one session
client = requests.session()
# Retrieve the CSRF token first
tree = etree.HTML(client.get(url).content)
csrf = tree.xpath('//input[#name="csrf_token"]/#value')[0]
#form data
formData = dict(csrf_token=csrf, ward=wardName)
headers = {'referer': url, 'content-type': 'application/x-www-form-urlencoded', 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
#use same session client
r = client.post(url, data=formData, headers=headers)
print r.content
It will give you the html with the result data table.

Categories

Resources