Web scraping a JSON file with requests

Web scraping a JSON file with requests - python

I'm trying to retrieve the timetable from this site using Requests.
I make the post sending the right parameters and get back the empty HTML skeleton, but instead I would like to get the json file returned.
Here is what I see when inspecting the page and highlighted you can see the file I want to retrieve.
Here is my code so far:
url = "https://alilauro-tickets.certusonline.com/"
headers = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.3'}
data = "msg=TimeTable&req=%7B%22getAvailability%22%3A%22Y%22%2C%22getBasicPrice%22%3A%22Y%22%2C%22getRouteAnalysis%22%3A%22Y%22%2C%22directOnly%22%3A%22Y%22%2C%22legs%22%3A%224%22%2C%22pax%22%3A1%2C%22origin%22%3A%22BEV%22%2C%22destination%22%3A%22ISC%22%2C%22tripRequest%22%3A%5B%7B%22tripfrom%22%3A%22BEV%22%2C%22tripto%22%3A%22ISC%22%2C%22tripdate%22%3A%222020-03-19%22%2C%22tripleg%22%3A0%7D%2C%7B%22tripfrom%22%3A%22ISC%22%2C%22tripto%22%3A%22BEV%22%2C%22tripdate%22%3A%222020-03-19%22%2C%22tripleg%22%3A1%7D%2C%7B%22tripfrom%22%3A%22BEV%22%2C%22tripto%22%3A%22FOR%22%2C%22tripdate%22%3A%222020-03-19%22%2C%22tripleg%22%3A2%7D%2C%7B%22tripfrom%22%3A%22FOR%22%2C%22tripto%22%3A%22BEV%22%2C%22tripdate%22%3A%222020-03-19%22%2C%22tripleg%22%3A3%7D%5D%7D"
r = requests.post(url, data=data, headers=headers, timeout=20)

The request should be as below:
url = 'https://alilauro-tickets.certusonline.com/php/proxy.php'
headers = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.3'}
data = {
'msg': 'TimeTable',
'req': '{"getAvailability":"Y","getBasicPrice":"Y","getRouteAnalysis":"Y","directOnly":"Y","legs":1,"pax":1,"origin":"BEV","destination":"FOR","tripRequest":[{"tripfrom":"BEV","tripto":"FOR","tripdate":"2020-03-18","tripleg":0}]}'
}
response = requests.post(url, headers=headers, data=data)

Related

Getting the same response different URL

I'm getting the same response from these 2 URLs:
First URL
Second URL
This is the code I'm using:
import requests
url = "https://www.amazon.it/blackfriday"
querystring = {"ref_":"nav_cs_gb_td_bf_dt_cr","deals-widget":"{\"version\":1,\"viewIndex\":60,\"presetId\":\"deals-collection-all-deals\",\"sorting\":\"BY_SCORE\"}"}
payload = ""
headers = {"cookie": "session-id=260-4643637-2647537; session-id-time=2082787201l; i18n-prefs=EUR; ubid-acbit=258-7747562-7485655; session-token=%22aZB70z2dnXHbhJ9e02ESp7q6xO23IGnDFT2iBCiPXZFoBTTEguAJ%2FBSnV7ud6bjAca64nh3bMF1bwDykOBf9BV%2BVjbx4tUQCyBkrg8tyR8PLZ8cjzpCz%2FzQSAmjiL6mSBcspkF8xuV0bxqLeRX7JQCMrHVBFf%2BsUhxV%2FMBLCH8UPk2o5aNL7OyAFCODBdRqm72RK5DAoKeMUymlVEOtqzvZSJbP%2Fut0gobiXJblRM2c%3D%22"}
response = requests.request("GET", url, data=payload, headers=headers, params=querystring)
I would like to get the same response that i get on the browser
How can i do it? Why does this happen?

You have to trick the server into thinking you are a browser. You can accomplish this by setting the user agent header.
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36',"cookie": "session-id=260-4643637-2647537; session-id-time=2082787201l; i18n-prefs=EUR; ubid-acbit=258-7747562-7485655; session-token=%22aZB70z2dnXHbhJ9e02ESp7q6xO23IGnDFT2iBCiPXZFoBTTEguAJ%2FBSnV7ud6bjAca64nh3bMF1bwDykOBf9BV%2BVjbx4tUQCyBkrg8tyR8PLZ8cjzpCz%2FzQSAmjiL6mSBcspkF8xuV0bxqLeRX7JQCMrHVBFf%2BsUhxV%2FMBLCH8UPk2o5aNL7OyAFCODBdRqm72RK5DAoKeMUymlVEOtqzvZSJbP%2Fut0gobiXJblRM2c%3D%22"}

How can I use url lib using a json file

I'm trying to get data from a json link, but I'm getting this error: TypeError: can't concat str to bytes
This is my code:
l = "https://www.off---white.com/en/IT/men/products/omch016f18d471431088s"
url = (l+".json"+"?porcoiddio")
req = urllib.request.Request(url, headers)
response = urllib.request.urlopen(req)
size_opts = json.loads(response.decode('utf-8'))['available_sizes']
How can I solve this error?

Your question answer is change your code to:
size_opts = json.loads(response.read().decode('utf-8'))['available_sizes']
Change at 2018-10-02 22:55 : I view your source code and found Response 503 , the reason why you got 503 is that request did not contain cookies:
req = urllib.request.Request(url, headers=headers)
you have update your headers.
headers.update({"Cookie":cookie_value})
req = urllib.request.Request(url, headers=headers) # !!!! you need a headers include cookies !!!!

you are providing the data argument by mistake …
you'll have to use a keyword argument for headers as otherwise the second argument will be filled with positional input, which happens to be data, try this:
req = urllib.request.Request(url, headers=headers)
See https://docs.python.org/3/library/urllib.request.html#urllib.request.Request for a documentation of Requests signature.

You could have a go using requests instead?
import requests, json
l = "https://www.off---white.com/en/IT/men/products/omch016f18d471431088s"
url = (l+".json"+"?porcoiddio")
session = requests.Session()
session.mount('http://', requests.adapters.HTTPAdapter(max_retries=10))
size_opts = session.get(url, headers= {'Referer': 'off---white.com/it/IT/login', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}).json()['available_sizes']
To check the response:
size_opts = session.get(url, headers= {'Referer': 'off---white.com/it/IT/login', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'})
print(size_opts)
Gives
<Response [503]>
This response means: "503 Service Unavailable. The server is currently unable to handle the request due to a temporary overload or scheduled maintenance"
I would suggest the problem isn't the code but the server?

Python Requests Get not Working

I have a simple Get request I'd like to make using Python's Request library.
import requests
HEADERS = {'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/45.0.2454.101 Safari/537.36'),
'referer': 'http://stats.nba.com/scores/'}
url = 'http://stats.nba.com/stats/playbyplayv2?EndPeriod=10&EndRange=55800&GameID=0021500281&RangeType=2&Season=2016-17&SeasonType=Regular+Season&StartPeriod=1&StartRange=0'
response = requests.get(url, timeout=5, headers=HEADERS)
However, when I make the requests.get call, I get the error requests.exceptions.ReadTimeout: HTTPConnectionPool(host='stats.nba.com', port=80): Read timed out. (read timeout=5). But I am able to copy/paste that url into my browser and view the resulting JSON. Why is requests not able to get the result?

Your HEADERS format is wrong. I tried with this code and it worked without any issues:
import requests
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',
}
url = 'http://stats.nba.com/stats/playbyplayv2?EndPeriod=10&EndRange=55800&GameID=0021500281&RangeType=2&Season=2016-17&SeasonType=Regular+Season&StartPeriod=1&StartRange=0'
response = requests.get(url, timeout=5, headers=HEADERS)
print(response.text)

Python requests: CSRF token missing or incorrect

I have a simple HTML page where I am trying to post form data using requests.post(); however, I keep getting Bad Request 400. CSRF token missing or incorrect even though I am passing it URL-encoded.
Please help.
url = "https://recruitment.advarisk.com/tests/scraping"
res = requests.get(url)
tree = etree.HTML(res.content)
csrf = tree.xpath('//input[#name="csrf_token"]/#value')[0]
postData = dict(csrf_token=csrf, ward=wardName)
print(postData)
postUrl = urllib.parse.quote(csrf)
formData = dict(csrf_token=postUrl, ward=wardName)
print(formData)
headers = {'referer': url, 'content-type': 'application/x-www-form-urlencoded', 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
page = requests.post(url, data=formData, headers=headers)
return page.content

You have make sure the requests in one session, so that the csrf_token will be matched:
import sys
import requests
wardName = "DHANLAXMICOMPLEX"
url = 'https://recruitment.advarisk.com/tests/scraping'
#make the requests in one session
client = requests.session()
# Retrieve the CSRF token first
tree = etree.HTML(client.get(url).content)
csrf = tree.xpath('//input[#name="csrf_token"]/#value')[0]
#form data
formData = dict(csrf_token=csrf, ward=wardName)
headers = {'referer': url, 'content-type': 'application/x-www-form-urlencoded', 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
#use same session client
r = client.post(url, data=formData, headers=headers)
print r.content
It will give you the html with the result data table.

Python scraping response data

i would like to take the response data about a specific website.
I have this site:
https://enjoy.eni.com/it/milano/map/
and if i open the browser debuger console i can see a posr request that give a json response:
how in python i can take this response by scraping the website?
Thanks

Apparently the webservice has a PHPSESSID validation so we need to get it first using proper user agent:
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'
}
r = requests.get('https://enjoy.eni.com/it/milano/map/', headers=headers)
session_id = r.cookies['PHPSESSID']
headers['Cookie'] = 'PHPSESSID={};'.format(session_id)
res = requests.post('https://enjoy.eni.com/ajax/retrieve_vehicles', headers=headers, allow_redirects=False)
json_obj = json.loads(res.content)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping a JSON file with requests - python

Related

Getting the same response different URL

How can I use url lib using a json file

Python Requests Get not Working

Python requests: CSRF token missing or incorrect

Python scraping response data

Categories

Resources