Error formatting the body of a scrapy request

Error formatting the body of a scrapy request - python

When I make an scrapy request without formatting the body I get the right results, however, when I format it in order to make a loop I get a 400 error.
This is the body that's not formatted:
'{"fields":"id,angellist_url,job_roles","limit":25,"offset":0,"form_data":{"must":{"filters":{"founding_or_hq_slug_locations":{"values":["spain"],"execution":"or"}},"execution":"and"},"should":{"filters":{}},"must_not":{"growth_stages":["mature"],"company_type":["service provider","government nonprofit"],"tags":["outside tech"],"company_status":["closed"]}},"keyword":null,"sort":"-last_funding_date"}'
This is the formatted body:
'{"fields":"id,angellist_url,job_roles","limit":25,"offset":{offset_items},"form_data":{"must":{"filters":{"founding_or_hq_slug_locations":{"values":["spain"],"execution":"or"}},"execution":"and"},"should":{"filters":{}},"must_not":{"growth_stages":["mature"],"company_type":["service provider","government nonprofit"],"tags":["outside tech"],"company_status":["closed"]}},"keyword":null,"sort":"-last_funding_date"}'
Then when making the request I use:
yield scrapy.Request(url = url, headers = headers, body = body.format(offset_items = '0'))

The two are identical if not for **offset_items** in your 'formatted' string which is causing the issue. Scrapy requires a int for the offset.
You're trying to pass the offset kwarg value to the body using .format() but using wrong syntax. Best solution is to use an f-string to pass in the value.
You can use something like difflib in python to quickly check the difference between the two strings, which will allow you to easily troubleshoot issues like these

Related

Input variable name as raw string into request in python

I am kind of very new to python.
I tried to loop through an URL request via python and I want to change one variable each time it loops.
My code looks something like this:
codes = ["MCDNDF3","MCDNDF4"]
#count = 0
for x in codes:
response = requests.get(url_part1 + str(codes) + url_part3, headers=headers)
print(response.content)
print(response.status_code)
print(response.url)
I want to have the url change at every loop to like url_part1+code+url_part3 and then url_part1+NEXTcode+url_part3.
Sadly my request badly formats the string from the variable to "%5B'MCDNDF3'%5D".
It should get inserted as a raw string each loop. I don't know if I need url encoding as I don't have any special chars in the request. Just change code to MCDNDF3 and in the next request to MCDNDF4.
Any thoughts?
Thanks!

In your for loop, the first line should be:
response = requests.get(url_part1 + x + url_part3, headers=headers)
This will work assuming url_part1 and url_part3 are regular strings. x is already a string, as your codes list (at least in your example) contains only strings. %5B and %5D are [ and ] URL-encoded, respectively. You got that error because you called str() on a single-membered list:
>>> str(["This is a string"])
"['This is a string']"
If url_part1 and url_part3 are raw strings, as you seem to indicate, please update your question to show how they are defined. Feel free to use example.com if you don't want to reveal your actual target URL. You should probably be calling str() on them before constructing the full URL.

You’re putting the whole list in (codes) when you probably want x.

retrieved URLs, trouble building payload to use requests module

I'm a Python novice, thanks for your patience.
I retrieved a web page, using the requests module. I used Beautiful Soup to harvest a few hundred href objects (links). I used uritools to create an array of full URLs for the target pages I want to download.
I don't want everybody who reads this note to bombard the web server with requests, so I'll show a hypothetical example that is realistic for just 2 hrefs. The array looks like this:
hrefs2 = ['http://ku.edu/pls/WP040?PT001F01=910&pf7331=11',
'http://ku.edu/pls/WP040?PT001F01=910&pf7331=12']
If I were typing these into 100s of lines of code, I understand what to do in order to retrieve each page:
from lxml import html
import requests
url = 'http://ku.edu/pls/WP040/'
payload = {'PT001F01' : '910', 'pf7331' : '11')
r = requests.get(url, params = payload)
Then get the second page
payload = {'PT001F01' : '910', 'pf7331' : '12')
r = requests.get(url, params = payload)
And keep typing in payload objects. Not all of the hrefs I'm dealing with are sequential, not all of the payloads are different simply in the last integer.
I want to automate this and I don't see how to create the payloads from the hrefs2 array.
While fiddling with uritools, I find urisplit which can give me the part I need to parse into a payload:
[urisplit(x)[3] for x in hrefs2]
['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
Each one of those has to be turned into a payload object and I don't understand what to do.
I'm using Python3 and I used uritools because that appears to be the standards-compliant replacement of urltools.
I fell back on shell script to get pages with wget, which does work, but it is so un-Python-ish that I'm asking here for what to do. I mean, this does work:
import subprocess
for i in hrefs2:
subprocess.call(["wget", i])

You can pass the full url to requests.get() without splitting up the parameters.
>>> requests.get('http://ku.edu/pls/WP040?PT001F01=910&pf7331=12')
<Response [200]>
If for some reason you don't want to do that, you'll need to split up the parameters some how. I'm sure there are better ways to do it, but the first thing that comes to mind is:
a = ['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
# list to store all url parameters after they're converted to dicts
urldata = []
#iterate over list of params
for param in a:
data = {}
# split the string into key value pairs
for kv in param.split('&'):
# split the pairs up
b = kv.split('=')
# first part is the key, second is the value
data[b[0]] = b[1]
# After converting every kv pair in the parameter, add the result to a list.
urldata.append(data)
You could do this with less code but I wanted to be clear what was going on. I'm sure there is already a module somewhere out there that does this for you too.

Params request python querystring

I need to make a request with a querystring, like this ?where[id]=CXP, but when i define it in the requests params (params = {'where[id]': 'CXP'}) the request returns the internal server error 500.
r = requests.get('http://myurl', params=params)
What is the correct whay to make this request?
Thanks.

Friends.. I Think I solve it.
params allow string in the request.
So, I think it's not the best way, but i´m putting as:
r = requests.get('http://myurl', params='where[service]=CXP')
And works fine!

it is very much possible that the server requires json formating. Even though 500 doesn't look like that, it is still worth a try.
From requests' version 2.4.2 onwards, you can simply use json as parameter. It will take care of encoding for you. All you need to do is:
r = requests.get('http://myurl', json=params)
If you are (for whatever reason) using an older version of requests, you can try following:
import simplejson as json
r = requests.get('http://myurl', param=json.loads(params))
I always recommend simplejson over json package as it is updated more frequently.
EDIT: If you want to prevent urlencoding, your only solution might be to pass a string.
r = requests.get('http://myurl', param='where[id]=CXP')

Python facebook api about the limit and summary

https://developers.facebook.com/docs/workplace/custom-integrations/examples
there are example: download.py,in the function: def getFeed(group, name):
There is a piece of it:
params = "?fields=permalink_url,from,story,type,message,link,created_time,updated_time,likes.limit(0).summary(total_count),comments.limit(0).summary(total_count)"
What does the likes.limit(0).summary(total_count),comments.limit(0).summary(total_count) mean?
Specifically, what does limit(0) mean, and what does summary(total_count) mean?
Also, in this download.py, there is
DEFAULT_LIMIT = "100"# Set to true if you like seeing console output
What does this DEFAULT_LIMIT mean? Does it mean 100 pages or 100 feeds (posts)?

The params looks like just a strangely written get request. You can see in the code that it is sent to the server just like any other get request and returns text just like a get request too.
In a get request you have your url ex. youtube.com/watch and a questionmark denotes a get request just like the one at the beginning of the params string. What it is doing is sending a variable called fields to the server.
Additionally it is appending the variable limit (denoted by an ampersand or &) and setting it to 100, along with finally sending the "since" variable.
So, yes. Default limit is specifying how many entries you want to get in response to your get request.

Receiving 500 HTTP response when posting to website

I am attempting am attempting to extract some information from a website that requires a post to an ajax script.
I am trying to create an automated script however I am consitently running into an HTTP 500 error. This is in contrast to a different data pull I did from a
url = 'http://www.ise.com/ExchangeDataService.asmx/Get_ISE_Dividend_Volume_Data/'
paramList = ''
paramList += '"' + 'dtStartDate' + '":07/25/2014"'
paramList += ','
paramList += '"' + 'dtEndDate' + '":07/25/2014"';
paramList = '{' + paramList + '}';
response = requests.post(url, headers={
'Content-Type': 'application/json; charset=UTF-8',
'data': paramList,
'dataType':'json'
})
I was wondering if anyone had any recommendations as to what is happening. This isn't proprietary data as they allow you to manually download it in excel format.

The input you're generating is not valid JSON. It looks like this:
{"dtStartDate":07/25/2014","dtEndDate":07/25/2014"}
If you look carefully, you'll notice a missing " before the first 07.
This is one of many reasons you shouldn't be trying to generate JSON by string concatenation. Either build a dict and use json.dump, or if you must, use a multi-line string as a template for str.format or %.
Also, as bruno desthuilliers points out, you almost certainly want to be sending the JSON as the POST body, not as a data header in an empty POST. Doing it the wrong way does happen to work with some back-ends, but only by accident, and that's certainly not something you should be relying on. And if the server you're talking to isn't one of those back-ends, then you're sending the empty string as your JSON data, which is just as invalid.
So, why does this give you a 500 error? Probably because the backend is some messy PHP code that doesn't have an error handler for invalid JSON, so it just bails with no information on what went wrong, so the server can't do anything better than send you a generic 500 error.

If that's a copy/paste from you actual code, 'data' is probably not supposed to be part of the request headers. As a side note: you don't "post to an ajax script", you post to an URL. The fact that this URL is called via an asynchronous request from some javascript on some page of the site is totally irrelevant.

it sounds like a server error. So what your posting could breaking their api due to its formatting.
Or their api could be down.
http://pcsupport.about.com/od/findbyerrormessage/a/500servererror.htm

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error formatting the body of a scrapy request - python

Related

Input variable name as raw string into request in python

retrieved URLs, trouble building payload to use requests module

Params request python querystring

Python facebook api about the limit and summary

Receiving 500 HTTP response when posting to website

Categories

Resources