Python scraping response data

Python scraping response data - python

i would like to take the response data about a specific website.
I have this site:
https://enjoy.eni.com/it/milano/map/
and if i open the browser debuger console i can see a posr request that give a json response:
how in python i can take this response by scraping the website?
Thanks

Apparently the webservice has a PHPSESSID validation so we need to get it first using proper user agent:
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'
}
r = requests.get('https://enjoy.eni.com/it/milano/map/', headers=headers)
session_id = r.cookies['PHPSESSID']
headers['Cookie'] = 'PHPSESSID={};'.format(session_id)
res = requests.post('https://enjoy.eni.com/ajax/retrieve_vehicles', headers=headers, allow_redirects=False)
json_obj = json.loads(res.content)

Related

Web scraping a JSON file with requests

I'm trying to retrieve the timetable from this site using Requests.
I make the post sending the right parameters and get back the empty HTML skeleton, but instead I would like to get the json file returned.
Here is what I see when inspecting the page and highlighted you can see the file I want to retrieve.
Here is my code so far:
url = "https://alilauro-tickets.certusonline.com/"
headers = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.3'}
data = "msg=TimeTable&req=%7B%22getAvailability%22%3A%22Y%22%2C%22getBasicPrice%22%3A%22Y%22%2C%22getRouteAnalysis%22%3A%22Y%22%2C%22directOnly%22%3A%22Y%22%2C%22legs%22%3A%224%22%2C%22pax%22%3A1%2C%22origin%22%3A%22BEV%22%2C%22destination%22%3A%22ISC%22%2C%22tripRequest%22%3A%5B%7B%22tripfrom%22%3A%22BEV%22%2C%22tripto%22%3A%22ISC%22%2C%22tripdate%22%3A%222020-03-19%22%2C%22tripleg%22%3A0%7D%2C%7B%22tripfrom%22%3A%22ISC%22%2C%22tripto%22%3A%22BEV%22%2C%22tripdate%22%3A%222020-03-19%22%2C%22tripleg%22%3A1%7D%2C%7B%22tripfrom%22%3A%22BEV%22%2C%22tripto%22%3A%22FOR%22%2C%22tripdate%22%3A%222020-03-19%22%2C%22tripleg%22%3A2%7D%2C%7B%22tripfrom%22%3A%22FOR%22%2C%22tripto%22%3A%22BEV%22%2C%22tripdate%22%3A%222020-03-19%22%2C%22tripleg%22%3A3%7D%5D%7D"
r = requests.post(url, data=data, headers=headers, timeout=20)

The request should be as below:
url = 'https://alilauro-tickets.certusonline.com/php/proxy.php'
headers = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.3'}
data = {
'msg': 'TimeTable',
'req': '{"getAvailability":"Y","getBasicPrice":"Y","getRouteAnalysis":"Y","directOnly":"Y","legs":1,"pax":1,"origin":"BEV","destination":"FOR","tripRequest":[{"tripfrom":"BEV","tripto":"FOR","tripdate":"2020-03-18","tripleg":0}]}'
}
response = requests.post(url, headers=headers, data=data)

Download bing image search results using python (custom url)

I want to download bing search images using python code.
Example URL: https://www.bing.com/images/search?q=sketch%2520using%20iphone%2520students
My python code generates an url of bing search as shown in example. Next step, is to download all images shown in that link on my local desktop.
In my project i am generating some words in python and my code generates bing image search URL. All i need is to download images shown on that search page using python.

To download an image, you need to make a request to the image URL that ends with .png, .jpg etc.
But Bing provides a "m" attribute inside the <a> element that stores needed data in the JSON format from which you can parse the image URL that is stored in the "murl" key and download it afterward.
To download all images locally to your computer, you can use 2 methods:
# bs4
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
image = requests.get(img_url, headers=headers, timeout=30)
query = query.lower().replace(" ", "_")
if image.status_code == 200:
with open(f"images/{query}_image_{index}.jpg", 'wb') as file:
file.write(image.content)
# urllib
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
query = query.lower().replace(" ", "_")
opener = req.build_opener()
opener.addheaders=[("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36")]
req.install_opener(opener)
req.urlretrieve(img_url, f"images/{query}_image_{index}.jpg")
In the first case, you can use context manager with open() to load the image locally. In the second case, you can use urllib.request.urlretrieve method of the urllib.request library.
Also, make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Note: An error might occur with the urllib.request.urlretrieve method where some of the request has got a captcha or something else that returns an unsuccessful status code. The biggest problem is it's hard to test for response code while requests provide a status_code method to test it.
Code and full example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml, json
query = "sketch using iphone students"
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": query,
"first": 1
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36"
}
response = requests.get("https://www.bing.com/images/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(response.text, "lxml")
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
image = requests.get(img_url, headers=headers, timeout=30)
query = query.lower().replace(" ", "_")
if image.status_code == 200:
with open(f"images/{query}_image_{index}.jpg", 'wb') as file:
file.write(image.content)
Using urllib.request.urlretrieve.
from bs4 import BeautifulSoup
import requests, lxml, json
import urllib.request as req
query = "sketch using iphone students"
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": query,
"first": 1
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36"
}
response = requests.get("https://www.bing.com/images/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(response.text, "lxml")
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
query = query.lower().replace(" ", "_")
opener = req.build_opener()
opener.addheaders=[("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36")]
req.install_opener(opener)
req.urlretrieve(img_url, f"images/{query}_image_{index}.jpg")
Output:

edit your code to find the designated image url and then use this code
use urllib.request
import urllib.request as req
imgurl ="https://i.ytimg.com/vi/Ks-_Mh1QhMc/hqdefault.jpg"
req.urlretrieve(imgurl, "image_name.jpg")

Python Requests Get not Working

I have a simple Get request I'd like to make using Python's Request library.
import requests
HEADERS = {'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/45.0.2454.101 Safari/537.36'),
'referer': 'http://stats.nba.com/scores/'}
url = 'http://stats.nba.com/stats/playbyplayv2?EndPeriod=10&EndRange=55800&GameID=0021500281&RangeType=2&Season=2016-17&SeasonType=Regular+Season&StartPeriod=1&StartRange=0'
response = requests.get(url, timeout=5, headers=HEADERS)
However, when I make the requests.get call, I get the error requests.exceptions.ReadTimeout: HTTPConnectionPool(host='stats.nba.com', port=80): Read timed out. (read timeout=5). But I am able to copy/paste that url into my browser and view the resulting JSON. Why is requests not able to get the result?

Your HEADERS format is wrong. I tried with this code and it worked without any issues:
import requests
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',
}
url = 'http://stats.nba.com/stats/playbyplayv2?EndPeriod=10&EndRange=55800&GameID=0021500281&RangeType=2&Season=2016-17&SeasonType=Regular+Season&StartPeriod=1&StartRange=0'
response = requests.get(url, timeout=5, headers=HEADERS)
print(response.text)

I get HTTPError: Not Found exception when I open the url

I would like to get the information on this page：
http://www.jnfdc.gov.cn/onsaling/viewhouse.shtml?fmid=757e06e0-c5b3-4384-9a14-2cb1eac011d1
From the browser debugger tools I get the information in this file：
http://www.jnfdc.gov.cn/r/house/757e06e0-c5b3-4384-9a14-2cb1eac011d1_154810896.xml
But when I use the browser to access the url directly, I can't get the file.
I don't know why.
I use python.
import urllib2
#url1 = 'http://www.jnfdc.gov.cn/onsaling/viewhouse.shtml?fmid=757e06e0-c5b3-4384-9a14-2cb1eac011d1'
url = 'http://www.jnfdc.gov.cn/r/house/757e06e0-c5b3-4384-9a14-2cb1eac011d1_113649432.xml'
headers = {
"Accept" :"*/*",
"Accept-Encoding" :"gzip, deflate, sdch",
"Accept-Language" :"zh-CN,zh;q=0.8",
"Cache-Control" :"max-age=0",
"Connection" :"keep-alive",
"Cookie" :"JSESSIONID=A205D8D7B0807FD34F879D6CB6EEB0CE",
"DNT" :"1",
"Host" :"www.jnfdc.gov.cn",
"Referer" :"http://www.jnfdc.gov.cn/onsaling/viewhouse.shtml?fmid=757e06e0-c5b3-4384-9a14-2cb1eac011d1",
"User-Agent" :"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.3051.400 QQBrowser/9.6.11301.400"
}
req = urllib2.Request(url, headers=headers)
resp = urllib2.urlopen(req) #this code throw exception:HTTPError: Not Found
How could I do? Thanks.

For getting data from browser you can try to use Selenium - Selenium doc

How to connect private area using Python and requets

I try to login to the member area of the following website :
https://trader.degiro.nl/
Unfortunately, I tried many way without success.
The post form since to be a json it's the reason why I sent a json instead of the post data
import requests
session = requests.Session()
data = {"username":"test", "password":"test", "isRedirectToMobile": "false", "loginButtonUniversal": ""}
url = "https://trader.degiro.nl/login/#/login"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.62 Safari/537.36'}
r = session.post(url, headers=headers, json={'json_payload': data})
Does any one have a idea why it doesn't work ?

Looking at the request my browser sends, the code should be:
url = "https://trader.degiro.nl/login/secure/login"
...
r = session.post(url, headers=headers, json=data)
That is, there's no need to wrap the data in json_payload and the url is slightly different to the one for viewing the login page.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python scraping response data - python

i would like to take the response data about a specific website. I have this site: https://enjoy.eni.com/it/milano/map/ and if i open the browser debuger console i can see a posr request that give a json response: how in python i can take this response by scraping the website? Thanks

Related

Web scraping a JSON file with requests

Download bing image search results using python (custom url)

Python Requests Get not Working

I get HTTPError: Not Found exception when I open the url

How to connect private area using Python and requets

Categories

Resources