HtmlResponse working in Scrapy Shell, but not in script? - python

I'm using scraperAPI.com to handle IP rotation for a scraping job I'm working on and I'm trying to implement their new post request method, But I keep receiving a 'HtmlResponse' object has no attribute 'dont_filter' error. Here is the custom start_requests function:
def start_requests(self):
S_API_KEY = {'key':'eifgvaiejfvbailefvbaiefvbialefgilabfva5465461654685312165465134654311'
}
url = "XXXXXXXXXXXXXX.com"
payload={}
headers = {
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'x-requested-with': 'XMLHttpRequest',
'Access-Control-Allow-Origin': '*',
'accept': 'application/json, text/javascript, */*; q=0.01',
'referer': 'XXXXXXXXXXX.com'
}
client = ScraperAPIClient(S_API_KEY['key'])
resp = client.post(url = url, body = payload, headers = headers)
yield HtmlResponse(resp.url, body = resp.text,encoding = 'utf-8')
The weird part is that when I execute this script piecewise in scrapy shell it works fine and returns the proper data, Any insight into this issue would be GREATLY appreciated? currently 4 hours into this problem.
Notes:
Client.post returns a response object
Not my real API key
client.post doesn't have a body method

The error you get is caused by returning the wrong type (a Response).
From the docs for start_requests:
This method must return an iterable with the first Requests to crawl for this spider.
It seems the easiest solution would be using a scrapy request (probably a FormRequest) to the API url, instead of using ScraperAPIClient.post().
You should be able to use ScraperAPIClient.scrapyGet() to generate the correct url, but I have not tested this.
If you would prefer to continue using the official api library, a slightly more complicated option is Writing your own downloader middleware.

Related

how to send if-none-match header with scrapy or requests in python?

I am scraping a site with Scrapy but some of it's API's are not returning JSON data without the 'if-none-match' header.
I have greater than 100 API's list so I want to generate automatic headers for getting a valid JSON file. anybody knows how to handle this or there is any other method to get rid of it.
Thanks in advance.
You can use the DEFAULT_REQUEST_HEADERS setting if you want to define headers for all requests:
# settings.py
DEFAULT_REQUEST_HEADERS={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'If-None-Match': '*',
}
or the headers parameter for individuals requests:
req = scrapy.Request(url, callback=self.parse, headers={'If-None-Match': '*'})

Python client for Compute Engine returns "Required field 'resource' not specified"

I'm trying to create a VM using the python client. The call I'm making is
import googleapiclient.discovery
compute = googleapiclient.discovery.build('compute', 'v1')
compute.instances().insert(
project='my-project',
zone='us-central1-c',
body=config).execute()
(config is a json string, available here)
and the response is
<HttpError 400 when requesting https://www.googleapis.com/compute/v1/projects/my-project/zones/us-central1-c/instances?alt=json
returned "Required field 'resource' not specified">
From this forum post and this stackexchange question, it appears the problem is with the REST API headers. However headers aren't exposed the python client, as far as I know.
Is this a bug or is there something else I might be doing incorrectly?
EDIT
Following the error back to googleapiclient.http.HttpRequest, it looks like the HttpRequest object generated by build() has headers
{ 'accept': 'application/json',
'accept-encoding': 'gzip, deflate',
'content-length': '2299',
'content-type': 'application/json',
'user-agent': 'google-api-python-client/1.7.7 (gzip)' }
I tried adding 'resource': 'none' to the headers and received the same response.
After looking at this for a while, I suspect the REST API is expecting a Compute Engine resource to be specified. However, searching for the word "resource" on the official docs yields 546 results.
EDIT2
Created GitHub Issue.
Use the request body(requestBody)" instead of resources.

HTTP headers - Requests - Python

I am trying to scrape a website in which the request headers are having some new (for me) attributes such as :authority, :method, :path, :scheme.
{':authority':'xxxx',':method':'GET',':path':'/xxxx',':scheme':'https','accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','accept-encoding':'gzip, deflate, br','accept-language':'en-US,en;q=0.9','cache-control':'max-age=0',GOOGLE_ABUSE_EXEMPTION=ID=0d5af55f1ada3f1e:TM=1533116294:C=r:IP=182.71.238.62-:S=APGng0u2o9IqL5wljH2o67S5Hp3hNcYIpw;1P_JAR=2018-8-1-9', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/68.0.3440.84Safari/537.36', 'x-client-data': 'CJG2yQEIpbbJAQjEtskBCKmdygEI2J3KAQioo8oBCIKkygE=' }
I tried passing them as headers with http request but ended up with error as shown below.
ValueError: Invalid header name b':scheme'
Any help would be appreciated on understanding and guidance on using them in passing request.
EDIT:
code added
import requests
url = 'https://www.google.co.in/search?q=some+text'
headers = {':authority':'xxxx',':method':'GET',':path':'/xxxx',':scheme':'https','accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','accept-encoding':'gzip, deflate, br','accept-language':'en-US,en;q=0.9','cache-control':'max-age=0','upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/68.0.3440.84Safari/537.36', 'x-client-data': 'CJG2yQEIpbbJAQjEtskBCKmdygEI2J3KAQioo8oBCIKkygE=' }
response = requests.get(url, headers=headers)
print(response.text)
Your error comes from here (python's source code)
Http headers cannot start with a semicolon as RFC states.
:authority, :method, :path, :scheme are not http headers
https://en.wikipedia.org/wiki/List_of_HTTP_header_fields
':method':'GET'
defines http request method
https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods
and
:authority, :path, :scheme
are parts of URI https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Generic_syntax

Error 500 sending python request as chrome

I am dealing with this little error but I can not get the solution. I authenticate into a page and I had opened the "inspect/network" chrome tool to see what web service is called and how. I found out this is used:
I have censored sensitive data releated to the site. So, I have to do this same request using python, but I am always getting error 500 and the log on the server side is not showing helpful information (only java traceback).
This is the code of the request
response = requests.post(url,data = 'username=XXXXX&password=XXXXXXX')
URL has the same string that you see in the image under "General/Request URL" label.
Data has the same string that you see in the image under "Form Data".
It looks very simple request but I can not get it to work :( .
Best regards
If you want your request appears like coming from Chrome, other than sending correct data you need to specify headers as well. The reason you got 500 error is probably there're certain settings on your server side disallowing traffic from "non-browsers".
So in your case, you need to add headers:
headers = {'Accept': 'application/json, text/plain, */*',
'Accept-Encoding': gzip, deflate,
...... # more
'User-Agent': 'Mozilla/5.0 XXXXX...' # this line tells the server what browser/agent is used for this request
}
response = requests.post(url,data = 'username=XXXXX&password=XXXXXXX', headers=headers)
P.S. If you are curious, default headers from requests are:
>>> import requests
>>> session = requests.Session()
>>> session.headers
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*', 'User-Agent': 'python-requests/2.13.0'}
As you can see the default User-Agent is python-requests/2.13.0, and some websites do block such traffic.

HTTP requests.post fails

I'm using the python requests library to get and post http content. I have no problem using the get function but my post function seems to fail or not do anything at all. From my understanding the requests library the POST function automatically encodes the data you send but I'm not sure if that's actually happening
code:
data = 'hash='+hash+'&confirm=Continue+as+Free+User'
r = requests.post(url,data)
html = r.text
by checking the "value" of html I can tell that the return response is that of the url without the POST.
You're not taking advantage of how requests will encode it for you. To do so, you need to write your code this way:
data = {'hash': hash, 'confirm': 'Continue as Free User'}
r = requests.post(url, data)
html = r.text
I can not test this for you but this is how the encoding happens automatically.
post(url, data=None, **kwargs)
Sends a POST request. Returns :class:`Response` object.
:param url: URL for the new :class:`Request` object.
:param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
:param \*\*kwargs: Optional arguments that ``request`` takes.
import requests
url = "http://computer-database.herokuapp.com/computers"
payload = "name=Hello11111122OKOK&introduced=1986-12-26&discontinued=2100-12-26&company=13"
headers = {
'accept-language': "en-US,en;q=0.9,kn;q=0.8",
'accept-encoding': "gzip, deflate",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
'content-type': "application/x-www-form-urlencoded",
'cache-control': "no-cache",
'postman-token': "3e5dabdc-149a-ff4c-a3db-398a7b52f9d5"
}
response = requests.request("POST", url, data=payload, headers=headers)
print(response.text)

Categories

Resources