In scrapyrt's POST documentation we can pass a JSON request like this, but how do you access the meta data like category and item in start_requests?
{
"request": {
"meta": {
"category": "some category",
"item": {
"discovery_item_id": "999"
}
},
, "start_requests": true
},
"spider_name": "target.com_products"
}
Reference: https://scrapyrt.readthedocs.io/en/latest/api.html#id1
There is an unmerged PR in scrapyRT that adds support to pass extra parameters inthe POST request.
1) Patch resources.py file located in scrapyrt folder.
In my case was /usr/local/lib/python3.5/dist-packages/scrapyrt/resources.py
Replace with this code: https://github.com/gdelfresno/scrapyrt/commit/ee3be051ea647358a6bb297632d1ea277a6c02f8
2) Now your spider can access to the new parameters with self.param1
ScrapyRT curl example:
curl -XPOST -d '{
"spider_name":"quotes",
"start_requests": true,
"param1":"ok"}' "http://localhost:9080/crawl.json"
In your spider
def parse(self, response):
print(self.param1)
Regards
Related
I want to perform a simple get request against a public api using the pythonrequests package.
I am using requests 2.25.1 and python 3.6.
Unfortunately, there is an extra & preprended to the URL parameters that I cannot figure out where it comes from. Example code below with the wrong url and correct url removing the ampersand.
import requests
import json
URL="https://search.rcsb.org/rcsbsearch/v1/query?json="
JSON={
"query": {
"type": "terminal",
"service": "text",
"parameters": { "value": "thymidine kinase" }
},
"return_type": "entry"
}
r=requests.get(url = URL,params=json.dumps(JSON, separators=(',', ':')))
r.url then is
https://search.rcsb.org/rcsbsearch/v1/query?json=&%7B%22query%22:%7B%22type%22:%22terminal%22,%22service%22:%22text%22,%22parameters%22:%7B%22value%22:%22thymidine%20kinase%22%7D%7D,%22return_type%22:%22entry%22%7D
which produces a 500 error.
if one changes json=&%7 to json=%7the request works. How can I get rid of the extra ampersand?
https://search.rcsb.org/rcsbsearch/v1/query?json=%7B%22query%22:%7B%22type%22:%22terminal%22,%22service%22:%22text%22,%22parameters%22:%7B%22value%22:%22thymidine%20kinase%22%7D%7D,%22return_type%22:%22entry%22%7D
The problem is that you're both trying to use the params keyword argument to requests.get and you're trying to build the parameter string yourself. Because of the ? in your URL, the underlying url manipulation code assumes that there are already parameters and adds new ones using & (as in someurl?param1=foo¶m2=bar...).
Pick one mechanism or the other, e.g.:
import json
import requests
URL="https://search.rcsb.org/rcsbsearch/v1/query"
JSON={
"query": {
"type": "terminal",
"service": "text",
"parameters": { "value": "thymidine kinase" }
},
"return_type": "entry"
}
r=requests.get(url = URL,params={'json': json.dumps(JSON, separators=(',', ':'))})
print(r)
I am learning Flask Restful API, while following some tutorials I came across an example
class Student(Resource):
def get(self):
return {student data}
def post(self, details):
return {data stored}
api.add_resource(Student,'/student')
here, looking at above example, we can use /student with GET,POST methods to retrieve and store data.
But I would like to have 2 different endpoints for retrieving and storing data, each.
for example
/student/get
which will call get() function of class Student, to retrieve records of all students, and
/student/post
which will call post() function of class Student, to store the sent/posted data.
Is it possible to have a single student class but call different methods referred by different endpoints.
Yes, it is possible to have a single Resource class but call different methods referred by different endpoints.
Scenario:
We have a Student class with get and post method.
We can use different route to execute the get and post method separately or combined.
E.g.:
Endpoint http://localhost:5000/students/get can only be used for get request of Student class.
Endpoint http://localhost:5000/students/post can only be used for post request of Student class.
Endpoint http://localhost:5000/students/ can be used for both get, and post request of Student class.
Solution:
To control the different endpoint requests in the resource class, we need to pass some keyword arguments to it.
We will use resource_class_kwargs option of add_resource method. Details of add_resource can be found in the official documentation
We will block any unwanted method call using abort method. We will return a HTTP status 405 with a response message Method not allowed for unwanted method calls in endpoints.
Code:
from flask import Flask
from flask_restful import Resource, Api, abort, reqparse
app = Flask(__name__)
api = Api(app)
parser = reqparse.RequestParser()
parser.add_argument('id', type=int, help='ID of the student')
parser.add_argument('name', type=str, help='Name of the student')
def abort_if_method_not_allowed():
abort(405, message="Method not allowed")
students = [{"id" : 1, "name": "Shovon"},
{"id" : 2, "name": "arsho"}]
class Student(Resource):
def __init__(self, **kwargs):
self.get_request_allowed = kwargs.get("get_request_allowed", False)
self.post_request_allowed = kwargs.get("post_request_allowed", False)
def get(self):
if not self.get_request_allowed:
abort_if_method_not_allowed()
return students
def post(self):
if not self.post_request_allowed:
abort_if_method_not_allowed()
student_arguments = parser.parse_args()
student = {'id': student_arguments['id'],
'name': student_arguments['name']}
students.append(student)
return student, 201
api.add_resource(Student, '/students', endpoint="student",
resource_class_kwargs={"get_request_allowed": True, "post_request_allowed": True})
api.add_resource(Student, '/students/get', endpoint="student_get",
resource_class_kwargs={"get_request_allowed": True})
api.add_resource(Student, '/students/post', endpoint="student_post",
resource_class_kwargs={"post_request_allowed": True})
Expected behaviors:
curl http://localhost:5000/students/get should call the get method of Student class.
curl http://localhost:5000/students/post -d "id=3" -d "name=Sho" -X POST -v should call the post method of Student class.
curl http://localhost:5000/students can call both of the methods of Student class.
Testing:
We will call our enlisted endpoints and test if the behavior is expected for each endpoints.
Output of get request in students/get using curl http://localhost:5000/students/get:
[
{
"id": 1,
"name": "Shovon"
},
{
"id": 2,
"name": "arsho"
}
]
Output of post request in students/post using curl http://localhost:5000/students/post -d "id=3" -d "name=Shody" -X POST -v:
{
"id": 3,
"name": "Shody"
}
Output of get request in students using curl http://localhost:5000/students:
[
{
"id": 1,
"name": "Shovon"
},
{
"id": 2,
"name": "arsho"
},
{
"id": 3,
"name": "Shody"
}
]
Output of post request in students using curl http://localhost:5000/students -d "id=4" -d "name=Ahmedur" -X POST -v:
{
"id": 4,
"name": "Ahmedur"
}
Output of post request in students/get using curl http://localhost:5000/students/get -d "id=5" -d "name=Rahman" -X POST -v:
{
"message": "Method not allowed"
}
Output of get request in students/post using curl http://localhost:5000/students/post:
{
"message": "Method not allowed"
}
References:
Official documentation of add_resource method
I'm building a shell application that allows my teammates to start new projects by running a few commands. It should be able to create a new project and a new repository inside that project.
Although I'm specifying the project key/uuid when creating a new repository, it doesn't work. What I'm expecting is a success message with the details for the new repository. Most of the time, this is what I get:
{"type": "error", "error": {"message": "string indices must be integers", "id": "ef4c2b1b49c74c7fbd557679a5dd0e58"}}
or the repository goes to the first project created for that team (which is the default behaviour when no project key/uuid is specified, according to Bitbucket's API documentation).
So I'm guessing there's something in between my request & their code receiving it? Because it looks like they're not even getting the request data.
# Setup Request Body
rb = {
"scm": "git",
"project": {
"key": "PROJECT_KEY_OR_UUID"
}
}
# Setup URL
url = "https://api.bitbucket.org/2.0/repositories/TEAM_NAME/REPOSITORY_NAME"
# Request
r = requests.post(url, data=rb)
In the code from the api docs you'll notice that the Content-Type header is "application/json".
$ curl -X POST -H "Content-Type: application/json" -d '{
"scm": "git",
"project": {
"key": "MARS"
}
}' https://api.bitbucket.org/2.0/repositories/teamsinspace/hablanding
In your code you're passing your data in the data parameter, which creates an "application/x-www-form-urlencoded" Content-Type header, and urlencodes your post data.
Instead, you should use the json parameter.
rb = {
"scm": "git",
"project": {
"key": "PROJECT_KEY_OR_UUID"
}
}
url = "https://api.bitbucket.org/2.0/repositories/TEAM_NAME/REPOSITORY_NAME"
r = requests.post(url, json=rb)
Mongodb
db.entity.find()
{
"_id" : ObjectId("5563a4c5567b3104c9ad2951"),
"section" : "section1",
"chapter" : "chapter1",
...
},
{
"_id" : ObjectId("5563a4c5567b3104c9ad2951"),
"section" : "section1",
"chapter" : "chapter2",
...
},....
In my database, the collections entity contain mainly section and chapter. Only chapter values are unique and when we query mongo for a given section it will return several results (one section matches with many chapters).
What I need to do is to get all collections of a given section, it's as simple as that.
settings.py
URL_PREFIX = "api"
API_VERSION = "v1"
ALLOWED_FILTERS = ['*']
schema = {
'section': {
'type': 'string',
},
'chapter': {
'type': 'string',
},
}
entity = {
'item_title': 'entity',
'resource_methods': ['GET'],
'cache_control': '',
'cache_expires': 0,
'url': 'entity/<regex("\w+"):section>/section',
'schema': schema
}
DOMAIN = {
'entity': entity,
}
run.py
from eve import Eve
if __name__ == '__main__':
app.run(debug=True)
What I tried
curl -L -X GET -H 'Content-Type: application/json' http://127.0.0.1:5000/api/v1/entity/23/section
OUTPUT
{
"_status": "ERR",
"_error": {
"message": "The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.",
"code": 404
}
}
What am I doing wrong? How can I get all entities of one section?
Did you try to write the url like this:
http://127.0.0.1:5000/api/v1/entity?where=section=='section32'
please have a look at this python-eve.org/features.html#filtering
Try with:
curl -H 'Accept: application/json' http://127.0.0.1:5000/api/v1/entity/section23/section
You want to use section23 in your url since that's what matches both the regex and the actual value stored on the database.
You might also want to change the url to:
'url': 'entity/<regex("\w+"):section>'
Which would allow for urls like this:
http://127.0.0.1:5000/api/v1/entity/section23
Which is probably more semantic. Hope this helps.
I am planning to build a plug-in for Sphinx documentation system plug-in which shows the names and Github profile links of the persons who have contributed to the documentation page.
Github has this feature internally
Is it possible to get Github profile links of the file contributors through Github API? Note that commiter emails are not enough, one must be able to map them to a Github user profile link. Also note that I don't want all repository contributors - just individual file contributors.
If this is not possible then what kind of alternative methods (private API, scraping) you could suggest to extract this information from Github?
First, you can show the commits for a given file:
https://api.github.com/repos/:owner/:repo/commits?path=PATH_TO_FILE
For instance:
https://api.github.com/repos/git/git/commits?path=README
Second, that JSON response does, in the author section, contain an url filed named 'html_url' to the GitHub profile:
"author": {
"login": "gitster",
"id": 54884,
"avatar_url": "https://0.gravatar.com/avatar/750680c9dcc7d0be3ca83464a0da49d8?d=https%3A%2F%2Fidenticons.github.com%2Ff8e73a1fe6b3a5565851969c2cb234a7.png",
"gravatar_id": "750680c9dcc7d0be3ca83464a0da49d8",
"url": "https://api.github.com/users/gitster",
"html_url": "https://github.com/gitster", <==========
"followers_url": "https://api.github.com/users/gitster/followers",
"following_url": "https://api.github.com/users/gitster/following{/other_user}",
"gists_url": "https://api.github.com/users/gitster/gists{/gist_id}",
"starred_url": "https://api.github.com/users/gitster/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/gitster/subscriptions",
"organizations_url": "https://api.github.com/users/gitster/orgs",
"repos_url": "https://api.github.com/users/gitster/repos",
"events_url": "https://api.github.com/users/gitster/events{/privacy}",
"received_events_url": "https://api.github.com/users/gitster/received_events",
"type": "User"
},
So you shouldn't need to scrape any web page here.
Here is a very crude jsfiddle to illustrate that, based on the javascript extract:
var url = "https://api.github.com/repos/git/git/commits?path=" + filename
$.getJSON(url, function(data) {
var twitterList = $("<ul />");
$.each(data, function(index, item) {
if(item.author) {
$("<li />", {
"text": item.author.html_url
}).appendTo(twitterList);
}
});
Using GraphQL API v4, you can use :
{
repository(owner: "torvalds", name: "linux") {
object(expression: "master") {
... on Commit {
history(first: 100, path: "MAINTAINERS") {
nodes {
author {
email
name
user {
email
name
avatarUrl
login
url
}
}
}
}
}
}
}
}
Try it in the explorer
Using curl & jq to have a list of the first 100 contributors of this file without duplicates :
TOKEN=<YOUR_TOKEN>
OWNER=torvalds
REPO=linux
BRANCH=master
FILEPATH=MAINTAINERS
curl -s -H "Authorization: token $TOKEN" \
-H "Content-Type:application/json" \
-d '{
"query": "{repository(owner: \"'"$OWNER"'\", name: \"'"$REPO"'\") {object(expression: \"'"$BRANCH"'\") { ... on Commit { history(first: 100, path: \"'"$FILEPATH"'\") { nodes { author { email name user { email name avatarUrl login url}}}}}}}}"
}' https://api.github.com/graphql | \
jq '[.data.repository.object.history.nodes[].author| {name,email}]|unique'
Why do you need to use Github API for that? You can just clone the package and use git log:
git log --format=format:%an path/to/file ver1..ver2 |sort |uniq
Until and unless it is not necessary to interact with GITHUB API directly one can get the list of contributors by cloning the repo down and then getting into the cloned directory and then getting the list from the github log file using shortlog command
import os
import commands
cmd = "git shortlog -s -n"
os.chdir("C:\Users\DhruvOhri\Documents\COMP 6411\pygithub3-0.3")
os.system("git clone https://github.com/poise/python.git")
os.chdir("/home/d/d_ohri/Desktop/python")
output = commands.getoutput(cmd)
print(output)
raw_input("press enter to continue")
There is one more way to list contributors in case one wants to use GITHUB API, we can use pytgithub3 wrapper to interact with GITHUB API and get list of contributors as follows using list_contributors:
from pytgithub3.services.repo import Repo
r = Repo()
r.lis_contributors(user='userid/author',repo='repo name')
for page in r:
for result in page:
print result