I need to scrape all the comments from an online newspaper article. Comments are loaded through an API:
https://api-graphql.lefigaro.fr/graphql?id=widget-comments_prod_commentsQuery2_a54719015f409774c55c77471444274cc461e7ee30bcbe03c8449e39ae15b16c&variables={%22id%22:%22bGVmaWdhcm8uZnJfXzYwMjdkODk4LTJjZWQtMTFlYi1hYmNlLTMyOGIwNDdhZjcwY19fQXJ0aWNsZQ==%22,%22page%22:2}
Therefore I am using requests to get its content:
url = "https://api-graphql.lefigaro.fr/graphql?id=widget-comments_prod_commentsQuery2_a54719015f409774c55c77471444274cc461e7ee30bcbe03c8449e39ae15b16c"
params = "variables={%22id%22:%22bGVmaWdhcm8uZnJfXzYwMjdkODk4LTJjZWQtMTFlYi1hYmNlLTMyOGIwNDdhZjcwY19fQXJ0aWNsZQ==%22,%22page%22:2}"
response = requests.get(url, params).json()
print(json.dumps(response, indent=4))
But what I need is to create a for loop so I can get every comments since only 10 comments are displayed at a time.
I can't find a way. I tried to use .format() with params like that:
params = "variables={%22id%22:%22bGVmaWdhcm8uZnJfXzYwMjdkODk4LTJjZWQtMTFlYi1hYmNlLTMyOGIwNDdhZjcwY19fQXJ0aWNsZQ==%22,%22page%22:{page_numb}}".format("page_numb":2)
But I get a SyntaxError.
str.format doesn't work with key-value pairs. Try "...{page_num}".format(page_num="2").
Also, since the { and } characters are part of the query payload, you'll have to escape them. { becomes {{ and } becomes }}. For example, "{hello {foo}}".format("foo": "world") becomes "{{hello {foo}}}".format(foo="world"). You'll also have to decode the url-encoded string:
from urllib.parse import unquote
params = unquote("variables={{%22id%22:%22bGVmaWdhcm8uZnJfXzYwMjdkODk4LTJjZWQtMTFlYi1hYmNlLTMyOGIwNDdhZjcwY19fQXJ0aWNsZQ==%22,%22page%22:{page_numb}}}".format(page_numb=2))
Related
I have been trying to figure out how to use python-requests to send a request that the url looks like:
http://example.com/api/add.json?name='hello'&data[]='hello'&data[]='world'
Normally I can build a dictionary and do:
data = {'name': 'hello', 'data': 'world'}
response = requests.get('http://example.com/api/add.json', params=data)
That works fine for most everything that I do. However, I have hit the url structure from above, and I am not sure how to do that in python without manually building strings. I can do that, but would rather not.
Is there something in the requests library I am missing or some python feature I am unaware of?
Also what do you even call that type of parameter so I can better google it?
All you need to do is putting it on a list and making the key as list like string:
data = {'name': 'hello', 'data[]': ['hello', 'world']}
response = requests.get('http://example.com/api/add.json', params=data)
What u are doing is correct only. The resultant url is same what u are expecting.
>>> payload = {'name': 'hello', 'data': 'hello'}
>>> r = requests.get("http://example.com/api/params", params=payload)
u can see the resultant url:
>>> print(r.url)
http://example.com/api/params?name=hello&data=hello
According to url format:
In particular, encoding the query string uses the following rules:
Letters (A–Z and a–z), numbers (0–9) and the characters .,-,~ and _ are left as-is
SPACE is encoded as + or %20
All other characters are encoded as %HH hex representation with any non-ASCII characters first encoded as UTF-8 (or other specified encoding)
So array[] will not be as expected and will be automatically replaced according to the rules:
If you build a url like :
`Build URL: http://example.com/api/add.json?name='hello'&data[]='hello'&data[]='world'`
OutPut will be:
>>> payload = {'name': 'hello', "data[]": 'hello','data[]':'world'}
>>> r = requests.get("http://example.com/api/params", params=payload)
>>> r.url
u'http://example.com/api/params?data%5B%5D=world&name=hello'
This is because Duplication will be replaced by the last value of the key in url and data[] will be replaced by data%5B%5D.
If data%5B%5D is not the problem(If server is able to parse it correctly),then u can go ahead with it.
Source Link
One solution if using the requests module is not compulsory, is using the urllib/urllib2 combination:
payload = [('name', 'hello'), ('data[]', ('hello', 'world'))]
params = urllib.urlencode(payload, doseq=True)
sampleRequest = urllib2.Request('http://example.com/api/add.json?' + params)
response = urllib2.urlopen(sampleRequest)
Its a little more verbose and uses the doseq(uence) trick to encode the url parameters but I had used it when I did not know about the requests module.
For the requests module the answer provided by #Tomer should work.
Some api-servers expect json-array as value in the url query string. The requests params doesn't create json array as value for parameters.
The way I fixed this on a similar problem was to use urllib.parse.urlencode to encode the query string, add it to the url and pass it to requests
e.g.
from urllib.parse import urlencode
query_str = urlencode(params)
url = "?" + query_str
response = requests.get(url, params={}, headers=headers)
The solution is simply using the famous function: urlencode
>>> import urllib.parse
>>> params = {'q': 'Python URL encoding', 'as_sitesearch': 'www.urlencoder.io'}
>>> urllib.parse.urlencode(params)
'q=Python+URL+encoding&as_sitesearch=www.urlencoder.io'
I have a request payload where i need to use a variable to get details from different products
payload = f"{\"filter\":\"1202-1795\",\"bpId\":\"\",\"hashedAgentId\":\"\",\"defaultCurrencyISO\":\"pt-PT\",\"regionId\":2001,\"tenantId\":1,\"homeRegionCurrencyUID\":48}"
i need to change the \"filter\":\"1202-1795\" to \"filter\":\"{variable}\" to populate the requests and get the info, but i'm struggling hard with the backslash in the f string
I tried to change for double to single quotes both inside the string and the opening and closing quotes, tried double {} and nothing works
this is my variable to populate the request
variable = ['1214-2291','1202-1823','1202-1795','1202-1742','1202-1719','1214-2000','1202-1198','1202-1090']
Create a dict, loop over items in the list, set filter key, make the request.
import requests
payload = {"bpId":"",
"hashedAgentId":"",
"defaultCurrencyISO":"pt-PT",
"regionId":2001,
"tenantId":1,
"homeRegionCurrencyUID":48}
items = ['1214-2291','1202-1823','1202-1795','1202-1742','1202-1719','1214-2000','1202-1198','1202-1090']
for item in items:
payload['filter'] = item
response = requests.get(url, json=payload)
You can check the difference between data and json parameters in python requests package
I need help on how do I split the parameter from an url in when using python requests get.
Assuming I have this url
https://blabla.io/bla1/blas/blaall/messages?data=%7B%22limit_count%22%3A100%2C%22limit_size%22%3A1000%7D
and I did requests.get by
_get = requests.get("https://blabla.io/bla1/blas/blaall/messages?data=%7B%22limit_count%22%3A100%2C%22limit_size%22%3A1000%7D", headers={"Authorization":"MyToken 1234abcd"})
I checked with _get.url and it return
u'https://blabla.io/bla1/blas/blaall/messages?data=%7B%22limit_count%22%3A100%2C%22limit_size%22%3A1000%7D'
Then I tried with the following to split the parameter
url = "https://blabla.io/bla1/blas/blaall/messages"
query = {"data[]":[{"limit_count":100, "limit_size":100}]}
headers = {"Authorization":"MyToken 1234abcd"}
_get = requests.get(url, params=query, headers=headers)
_get.url return the following result
u'https://blabla.io/bla1/blas/blaall/messages?data%5B%5D=limit_count&data%5B%5D=limit_size'
without 100 and 10000
In this kind of url --> https://blabla.io/bla1/blas/blaall/messages?data=%7B%22limit_count%22%3A100%2C%22limit_size%22%3A1000%7D, how exactly to split its parameter?
Thank you for your help.
So you are looking for:
data={"limit_count":100,"limit_size":1000}
as your query params.
Unfortunately, requests will not flatten this nested structure, it treats any Iterable value as multiple values for the key, e.g. your nest dictionary is treated like:
query = {'data': ['limit_count', 'limit_size']}
Which is why you don't see 100 and 1000 in the end result.
You will need to flatten it into a string. You can use json.dumps() to create the required string (double quotes vs. single quotes, compact). Then requests will do the required URL encoding, e.g.:
In []:
data = {'limit_count': 100, 'limit_size': 1000}
query = {'data': json.dumps(data, separators=(',', ':'))}
request.get('http://httpbin.org', params=query).url
Out[]:
'http://httpbin.org/?data=%7B%22limit_count%22%3A100%2C%22limit_size%22%3A1000%7D'
from urllib.parse import urlsplit, parse_qs
import requests
url = "https://blabla.io/bla1/blas/blaall/messages?data=%7B%22limit_count%22%3A100%2C%22limit_size%22%3A1000%7D"
query = urlsplit(url).query
params = parse_qs(query)
headers = {"Authorization":"MyToken 1234abcd"}
_get = requests.get(url, params=params, headers=headers)
As far as I know, you can't use the requests library to parse URLs.
You use that to handle the requests. If you want a URL parser, use urllib.parse instead.
I'm having problems getting data from an HTTP response. The format unfortunately comes back with '\n' attached to all the key/value pairs. JSON says it must be a str and not "bytes".
I have tried a number of fixes so my list of includes might look weird/redundant. Any suggestions would be appreciated.
#!/usr/bin/env python3
import urllib.request
from urllib.request import urlopen
import json
import requests
url = "http://finance.google.com/finance/info?client=ig&q=NASDAQ,AAPL"
response = urlopen(url)
content = response.read()
print(content)
data = json.loads(content)
info = data[0]
print(info)
#got this far - planning to extract "id:" "22144"
When it comes to making requests in Python, I personally like to use the requests library. I find it easier to use.
import json
import requests
r = requests.get('http://finance.google.com/finance/info?client=ig&q=NASDAQ,AAPL')
json_obj = json.loads(r.text[4:])
print(json_obj[0].get('id'))
The above solution prints: 22144
The response data had a couple unnecessary characters at the head, which is why I am only loading the relevant (json) portion of the response: r.text[4:]. This is the reason why you couldn't load it as json initially.
Bytes object has method decode() which converts bytes to string. Checking the response in the browser, seems there are some extra characters at the beginning of the string that needs to be removed (a line feed character, followed by two slashes: '\n//'). To skip the first three characters from the string returned by the decode() method we add [3:] after the method call.
data = json.loads(content.decode()[3:])
print(data[0]['id'])
The output is exactly what you expect:
22144
JSON says it must be a str and not "bytes".
Your content is "bytes", and you can do this as below.
data = json.loads(content.decode())
I am using requests to create a post request on a contractor's API. I have a JSON variable inputJSON that undergoes formatting like so:
def dolayoutCalc(inputJSON):
inputJSON = ast.literal_eval(inputJSON)
inputJSON = json.dumps(inputJSON)
url='http://xxyy.com/API'
payload = {'Project': inputJSON, 'x':y, 'z':f}
headers = {'content-type': 'application/json', 'Accept': 'text/plain'}
r = requests.post(url, data=json.dumps(payload), headers=headers)
My issue arises when I define payload={'Project':inputJSON, 'x':y, 'z':f}
What ends up happening is Python places a pair of quotes around the inputJSON structure. The API I am hitting is not able to handle this. It needs Project value to be the exact same inputJSON value just without the quotes.
What can I do to prevent python from placing quotes around my inputJSON object? Or is there a way to use requests library to handle such POST request situation?
inputJSON gets quotes around it because it's a string. When you call json.dumps() on something a string will come out, and then when it's converted to JSON it will get quotes around it. e.g.:
>>> import json
>>> json.dumps('this is a string')
>>> '"this is a string"'
I'm with AKS in that should be able to remove this line:
inputJSON = json.dumps(inputJSON)
From your description inputJSON sounds like a Python literal (e.g. {'blah': True} instead of {"blah": true}. So you've used the ast module to convert it into a Python value, and then in the final json.dumps() it should be converted to JSON along with everything else.
Example:
>>> import ast
>>> import json
>>> input = "{'a_var': True}" # A string that looks like a Python literal
>>> input = ast.literal_eval(input) # Convert to a Python dict
>>> print input
>>> {'a_var': True}
>>> payload = {'Project': input} # Add to payload as a dict
>>> print json.dumps(payload)
>>> {"Project": {"a_var": true}} # In the payload as JSON without quotes