Python crawler issue

Python crawler issue - python

I have a problem that I can't solve myself as it seems, I hope someone here might have another idea that can help me.
My plan is to crawl the data from comtrade for several countries and timeframes, but even my first call isn't working. The URL I want to send a get request to is http://comtrade.un.org/api/get?&r=32&freq=A&ps=2013&px=H4&cc=AG6&type=C&rg=2&p=0&head=M and if I enter this data in postman I get a proper response with plenty datasets, but if I try to from python I get the response
"{'Message': 'Empty parameters or null values are not permitted. For more information please visit http://comtrade.un.org/data/doc/api/'}"
instead. The API doesn't take any authentication and I didn't set any headers or did any other kind of change to postman, but there it works.
Please take a look at my code and tell me what I'm doing wrong. Did I miss something?
You can try it yourself using the above mentioned URL up to 100 times per hour, maybe you find a way to do so :)
My code:
import json
import requests
url = "http://comtrade.un.org/api/get?&r=32&freq=A&ps=2013&px=H4&cc=AG6&type=C&rg=2&p=0&head=M"
f = requests.get(url, timeout=300)
x = json.loads(f.text)
print(x)

The url is malformed, you should replace the ?& with ?, so the correct url becomes:
https://comtrade.un.org/api/get?r=32&freq=A&ps=2013&px=H4&cc=AG6&type=C&rg=2&p=0&head=M

import json
import requests
url = "http://comtrade.un.org/api/get?r=32&freq=A&ps=2013&px=H4&cc=AG6&type=C&rg=2&p=0&head=M"
f = requests.get(url, timeout=300)
x = json.loads(f.text)
print(x)
Hope it helps.

Related

Parsing Post Requests in Python

I'm new to a lot of API work and I'm currently working with getting a bearer token from an API. After doing the proper POST request I'm able to get it but I'm getting it in a format that I'd rather not have to work around. I just get a big string of JSON data.
import requests
import os
import webbrowser
CLIENT_ID = [Client_ID]
CLIENT_SECRET = [Client_SECRET]
REDIRECT_URI = [REDIRECT_URI]
RESPONSE_TYPE = [RESPONSE_TPYE]
params = {
"client_id": CLIENT_ID,
"client_secret": CLIENT_SECRET,
"redirect_uri": REDIRECT_URI,
"response_type": RESPONSE_TYPE
}
endpoint = [this is the url endpoint]
webbrowser.open(endpoint)
code = input("Enter the Code: ")
print(code)
endpoint2 = "[endpoint without the code]" + code
token_endpoint = requests.post(endpoint2)
print(token_endpoint.text)
When executing this code and going through the steps I'm left with this:
{"access_token":"[bearer token here}","token_type":"Bearer","expires_in":7200,"refresh_token":"-[refresh token here]","scope":"read","created_at":[time created at]}
This is a full string object since I'm passing it as "text." I can't get anything to print if I don't do that but I'm willing to change that to get the access_token object to be it's own variable that I can work with.
Any tips are appreciated. Thanks.
(note: sensitive information is just put into brackets)

You can use token_endpoint.json() to get a dictionary instead of text, and then select whichever key-value you want.
If you ever get stuck on anything request based, your first stop should be their excellent documentation here, most of the times that would answer the question you had. Case in point, here is the section relevant to your question

I've never needed to use webbrowser. I think response = requests.post(url, data=payload_dict) and json_response.json() will provide everything you need (cf. requests documentation).
For extracting specific elements via xpath this answer might be helpful: parse html response

JSON_PARSING_ERROR: Unexpected character (r) at position 0

I've been trying for a couple of days to send a notification via FCM using the Python requests package. But I've been struggling with the same issue over and over and I can't figure out what's wrong with my code.
Here's the JSON that I'm trying to send to Firebase:
{"registration_ids":["A token given by Firebase"],"notification": {"title":"1","body":"I'm a test message"}
I might have missed something, but as far as I know, the JSON message is well formatted. I've tried with both notification and message, but to no avail.
Here's the full code I'm using to do this:
import requests
URL = 'https://fcm.googleapis.com/fcm/send'
data = {"registration_ids":["A token from Firebase"],"notification": {"title":"1","body":"I'm a test message"}}
headers = {"Authorization":"key=My server key","Content-Type":"application/json"}
print(data)
r = requests.post(url=URL, data=data, headers=headers)
print(r.text)
It should return a message with a correct status but instead it's returning a 400 OK, JSON_PARSING_ERROR: Unexpected character (r) at position 0.
I'm not completely sure if I'm doing anything wrong. Thanks in advance!

If you want to send data as JSON, you need to actually produce that JSON:
import json
data=json.dumps(data)
requests.post(<...>,data=data)
or use post()'s json argument:
requests.post(<...>,json=data)

How to crawl dynamic web with api url returning null?

I have a task to crawl all Pulitzer Winner, and I found this page has all I want: https://www.pulitzer.org/prize-winners-by-year/2018.
But I got the following problems,
Problem 1: How to crawl a dynamic page? I use python/urllib2.urlopen, to get the page's content, but this dynamic page doesn't return the real content from this.
Problem 2: I then found an API URL from devtool: https://www.pulitzer.org/cache/api/1/winners/year/166/raw.json. But when I sent a GET request from urllib2.urlopen, I always get null. How does it happen? Or how can I handle with it?
If this is too naive for you, please name some words so that I can learn it from Google.
Thanks in advance!

One way to handle is to create a session using requests module. This way, it passes necessary session details required for next api call, you also have to pass one more parameter Referer to the header. This differentiates which year you are looking for in the api call.
import requests
s = requests.session()
url = "https://www.pulitzer.org/prize-winners-by-year/2017"
resp1 = s.get(url)
headers = {'Referer': 'https://www.pulitzer.org/prize-winners-by-year/2017'}
api = "https://www.pulitzer.org/cache/api/1/winners/year/166/raw.json"
data = s.get(api,headers=headers)
now you can extract the data from the response in data.

swagger API - create a perfect request

I have to work with an API which is a Swagger UI type, and I have the documentation of the API but seems like something is missing.
I make my request like this:
url = "https://something.net/some-other-part/api/devices/"
response = requests.get(url, verify=False, auth= ("xy", "xy"))
soup = bs(response.text)
I get back responses with code 200, so it's OK. BUT...
The response doesn't include everything I need, altough the response class says there is a lot of information I could except to get with the response.
Without knowing the response-class of the API, can you guys tell me how can you extend the URL or the request -with paramateres for example- to get more/more specific data?

Later on, I found on that the question was really not on point and solved the task on my own.

HTTP Error 401: Authorization Required while downloading a file from HTTPS website and saving it

Basically i need a program that given a URL, it downloads a file and saves it. I know this should be easy but there are a couple of drawbacks here...
First, it is part of a tool I'm building at work, I have everything else besides that and the URL is HTTPS, the URL is of those you would paste in your browser and you'd get a pop up saying if you want to open or save the file (.txt).
Second, I'm a beginner at this, so if there's info I'm not providing please ask me. :)
I'm using Python 3.3 by the way.
I tried this:
import urllib.request
response = urllib.request.urlopen('https://websitewithfile.com')
txt = response.read()
print(txt)
And I get:
urllib.error.HTTPError: HTTP Error 401: Authorization Required
Any ideas? Thanks!!

You can do this easily with the requests library.
import requests
response = requests.get('https://websitewithfile.com/text.txt',verify=False, auth=('user', 'pass'))
print(response.text)
to save the file you would type
with open('filename.txt','w') as fout:
fout.write(response.text):
(I would suggest you always set verify=True in the resquests.get() command)
Here is the documentation:

Doesn't the browser also ask you to sign in? Then you need to repeat the request with the added authentication like this:
Python urllib2, basic HTTP authentication, and tr.im
Equally good: Python, HTTPS GET with basic authentication

If you don't have Requests module, then the code below works for python 2.6 or later. Not sure about 3.x
import urllib
testfile = urllib.URLopener()
testfile.retrieve("https://randomsite.com/file.gz", "/local/path/to/download/file")

You can try this solution: https://github.qualcomm.com/graphics-infra/urllib-siteminder
import siteminder
import getpass
url = 'https://XYZ.dns.com'
r = siteminder.urlopen(url, getpass.getuser(), getpass.getpass(), "dns.com")
Password:<Enter Your Password>
data = r.read() / pd.read_html(r.read()) # need to import panda as pd for the second one

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python crawler issue - python

The url is malformed, you should replace the ?& with ?, so the correct url becomes: https://comtrade.un.org/api/get?r=32&freq=A&ps=2013&px=H4&cc=AG6&type=C&rg=2&p=0&head=M

import json import requests url = "http://comtrade.un.org/api/get?r=32&freq=A&ps=2013&px=H4&cc=AG6&type=C&rg=2&p=0&head=M" f = requests.get(url, timeout=300) x = json.loads(f.text) print(x) Hope it helps.

Related

Parsing Post Requests in Python

JSON_PARSING_ERROR: Unexpected character (r) at position 0

How to crawl dynamic web with api url returning null?

swagger API - create a perfect request

HTTP Error 401: Authorization Required while downloading a file from HTTPS website and saving it

Categories

Resources