I have a table (as a Pandas DF) of (mostly) github repos, for which I need to automatically extract the LICENSE link. However, it is a requirement that the link does not just simply go to the /blob/master/ but actually points to a specific commit as the master link might be updated at some point. I assembled a Python script to do this through the github API, but using the API I am only able to retrieve the link with the master tag.
I.e. instead of
https://github.com/jsdom/abab/blob/master/LICENSE.md
I want
https://github.com/jsdom/abab/blob/8abc2aa5b1378e59d61dee1face7341a155d5805/LICENSE.md
Any idea if there is a way to automatically get the link to the latest commit for a file, in this case the LICENSE file?
This is the code I have written so far:
def githubcrawl(repo_url, session, headers):
parts = repo_url.split("/")[3:]
url_tmpl = "http://api.github.com/repos/{}/license"
url = url_tmpl.format("/".join(parts))
try:
response = session.get(url, headers=headers)
if response.status_code in [404]:
return(f"404: {repo_url}")
else:
data = json.loads(response.text)
return(data["html_url"]) # Returns the html URL to LICENSE file
except urllib.error.HTTPError as e:
print(repo_url, "-", e)
return f"http_error: {repo_url}"
token="mytoken" # Token for github authentication to get more requests per hour
headers={"Authorization": "token %s" % token}
session = requests.Session()
lizlinks = [] # List to store the links of the LICENSE files in
# iterate over DataFrame of applications/deps
for idx, row in df.iterrows():
# if idx < 5:
if type(row["Homepage"]) == type("str"):
repo_url = re.sub(r"\#readme", "", row["Homepage"])
response = session.get(repo_url, headers=headers)
repo_url = response.url # Some URLs are just redirects, so I get the actual repo url here
if "github" in repo_url and len(repo_url.split("/")) >= 3:
link = githubcrawl(repo_url, session, headers)
print(link)
lizlinks.append(link)
else:
print(row["Homepage"], "Not a github Repo")
lizlinks.append("Not a github repo")
else:
print(row["Homepage"], "Not a github Repo")
lizlinks.append("Not a github repo")
Bonus-Question: Would parallelizing this task work with the Github-API? I.e. could I send multiple requests at once without being locked out (DoS) or is the for-loop a good approach to avoid this? It takes quite a while to go through the 1000ish of repos I have in that list.
Ok, I found a way to get the unique SHA-hash of the current commit. I believe that should always link to the license file of that point in time.
Using the python git library, i simply run the ls_remote git command and return the HEAD sha
def lsremote_HEAD(url):
g = git.cmd.Git()
HEAD_sha = g.ls_remote(url).split()[0]
return HEAD_sha
I can then replace the "master", "main" or whatever tag in my github_crawl function:
token="token_string"
headers={"Authorization": "token %s" % token}
session = requests.Session()
def githubcrawl(repo_url, session, headers):
parts = repo_url.split("/")[3:]
api_url_tmpl = "http://api.github.com/repos/{}/license"
api_url = api_url_tmpl.format("/".join(parts))
try:
print(api_url)
response = session.get(api_url, headers=headers)
if response.status_code in [404]:
return(f"404: {repo_url}")
else:
data = json.loads(response.text)
commit_link = re.sub(r"/blob/.+?/",rf"/blob/{lsremote_HEAD(repo_url)}/", data["html_url"])
return(commit_link)
except urllib.error.HTTPError as e:
print(repo_url, "-", e)
return f"http_error: {repo_url}"
Maybe this helps someone, so I'm posting this answer here.
This answer uses the following libraries:
import re
import git
import urllib
import json
import requests
Related
I have a list of LinkedIn posts IDs. I need to request share statistics for each of those posts with another request.
The request function looks like this:
def ugcp_stats(headers):
response = requests.get(f'https://api.linkedin.com/v2/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&ugcPosts=List(urn%3Ali%3AugcPost%3A{shid},urn%3Ali%3AugcPost%3A{shid2},...,urn%3Ali%3AugcPost%3A{shidx})', headers = headers)
ugcp_stats = response.json()
return ugcp_stats
urn%3Ali%3AugcPost%3A{shid},urn%3Ali%3AugcPost%3A{shid2},...,urn%3Ali%3AugcPost%3A{shidx} - these are the share urns. Their number depends on number of elements in my list.
What should I do next? Should I count the number of elements in my list and somehow amend the request URL to include all of them? Or maybe I should loop through the list and make a separate request for each of the elements and then append all the responses in one json file?
I'm struggling and I'm not quite sure how to write this. I don't even know how to parse the element into the request. Although I suspect it could look something like this:
for shid in shids:
def ugcp_stats(headers):
response = requests.get(f'https://api.linkedin.com/v2/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&ugcPosts=List(urn%3Ali%3AugcPost%3A & {shid})', headers = headers)
ugcp_stats = response.json()
return ugcp_stats
UPDATE - following your ansers
The code looks like this now:
link = "https://api.linkedin.com/v2/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&ugcPosts=List"
def share_stats(headers, shids):
# Local variable
sample = ""
# Sample the shids in the right pattern
for shid in shids: sample += "urn%3Ali%3AugcPost%3A & {},".format(shid)
# Get the execution of the string content
response = eval(f"requests.get('{link}({sample[:-1]})', headers = {headers})")
# Return the stats
return response.json()
if __name__ == '__main__':
credentials = 'credentials.json'
access_token = auth(credentials) # Authenticate the API
headers = headers(access_token) # Make the headers to attach to the API call.
share_stats = share_stats(headers) # Get shares
print(share_stats)
But nothing seems to be happening. It finishes the script, but I don't get anything. What's wrong?
This is just a proof of what I told you earlier as comment. Now you will adapt to your needs (even I try it to do it for you) :)
Updated - Base on your feedback.
#// IMPORT
#// I'm assuming your are using "requests" library
#// PyCharm IDE show me like this library is not used, but "eval()" is using it
import requests
#// GLOBAL VARIABLES
link: str = "https://api.linkedin.com/v2/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&ugcPosts=List"
#// Your function logic updated
def share_stats(sheds: list, header: dict) -> any:
# Local variable
sample = ""
# Sample the sheds in the right pattern
for shed in sheds: sample += "urn%3Ali%3AugcPost%3A & {},".format(shed)
# Get the execution of the string content
response = eval(f"requests.get('{link}({sample[:-1]})', headers = {header})")
# Return the stats as JSON file
return response.json()
#// Run if this is tha main file
if __name__ == '__main__':
#// An empty sheds list for code validation
debug_sheds: list = []
credentials: str = "credentials.json"
#// For me I get an unresolved reference for "auth", for you shod be fine
#// I'm assuming is your function for reading the file content and convert it to Python
access_token = auth(credentials) # Authenticate the API
#// Your error was from this script line
#// Error message: 'TypedDict' object is not callable
#// Your code: headers = headers(access_token)
#// When you want to get a dictionary value by a key use square brackets
headers = headers[access_token] # Make the headers to attach to the API call.
#// Here you shood ged an error/warning because you do not provided the sheds first time
#// Your code: share_stats = share_stats(headers)
share_stats = share_stats(debug_sheds, headers) # Get shares
print(share_stats)
I am trying to grab all the members we have in a Github Organization. We have about ~4K.
Using the documentation here, I am trying to page through the results but it not iterating through the pages of results.
Here is the Code:
from dotenv import load_dotenv, find_dotenv
import json
import requests
import os
load_dotenv(find_dotenv())
headers = {
"authorization": f"{os.getenv('github_token')}",
"content-type": "application/json"
}
query_url = "https://api.github.com/orgs/<name of Org>/members?page="
members = [ ]
page_no = 1
loop_control = 0
while loop_control == 0:
url = query_url + str(page_no)
request = requests.get(url, headers=headers)
print(url)
print(request.status_code)
response = request.json()
print(len(response))
for i in response:
members.append(i)
if len(response) == 30:
page_no += 1
elif len(response) < 30:
loop_control = 1
with open('data/github/response.json', 'w') as file:
print(len(members))
json.dump(members, file)
With the code, it grabbing the first 30 results, then it grabs 7 for page 2 of the results.
Any Ideas?
Two things to check about your script:
ensure the token is associated with an account that is a member of the organization
ensure your token has the read:org scope set
If one of these conditions are not met the script will only see users who have public membership for the organization, which would explain the difference in numbers you're seeing.
To also improve the script performance, you can add a per_page=100 query string parameter to get 100 results per API call, instead of the default 30. This is documented in the Pagination section of the API docs.
I managed to import queries into another account. I used the endpoint POST function given by Redash, it sort of just applies to just “modifying/replacing”: https://github.com/getredash/redash/blob/5aa620d1ec7af09c8a1b590fc2a2adf4b6b78faa/redash/handlers/queries.py#L178
So actually, if I want to import a new query what should I do? I want to create a new query that doesn’t exist on my account. I’m looking at https://github.com/getredash/redash/blob/5aa620d1ec7af09c8a1b590fc2a2adf4b6b78faa/redash/handlers/queries.py#L84
Following is the function which I made to create new queries if the query_id doesn’t exist.
url = path, api = user api, f = filename, query_id = query_id of file in local desktop
def new_query(url, api, f, query_id):
headers ={'Authorization': 'Key {}'.format(api), 'Content-Type': 'application/json'}
path = "{}/api/queries".format(url)
query_content = get_query_content(f)
query_info = {'query':query_content}
print(json.dumps(query_info))
response = requests.post(path, headers = headers, data = json.dumps(query_info))
print(response.status_code)
I am getting response.status_code 500. Is there anything wrong with my code? How should I fix it?
For future reference :-) here's a python POST that creates a new query:
payload = {
"query":query, ## the select query
"name":"new query name",
"data_source_id":1, ## can be determined from the /api/data_sources end point
"schedule":None,
"options":{"parameters":[]}
}
res = requests.post(redash_url + '/api/queries',
headers = {'Authorization':'Key YOUR KEY'},
json=payload)
(solution found thanks to an offline discussion with #JohnDenver)
TL;DR:
...
query_info = {'query':query_content,'data_source_id':<find this number>}
...
Verbose:
I had a similar problem. Checked redash source code, it looks for data_source_id. I added the data_source_id to my data payload which worked.
You can find the appropriate data_source_id by looking at the response from a 'get query' call:
import json
def find_data_source_id(url,query_number,api)
path = "{}/api/queries/{}".format(url,query_number)
headers ={'Authorization': 'Key {}'.format(api), 'Content-Type': 'application/json'}
response = requests.get(path, headers = headers)
return json.loads(response.text)['data_source_id']
The Redash official API document is so lame, it doesn't give any examples for the documented "Common Endpoints". I was having no idea how I should use the API key.
Instead check this saviour https://github.com/damienzeng73/redash-api-client .
What I am trying to do is: I want to extract the name of classes that have been modified in a pull request. To do that, I do the following:
From GitHub API:
1) I extract all the pull requests for one project
2) I extract all the commits for each pull request
3) I keep only the first commit and last commit for each pull request.
Since at this point, I don't know how to extract the list of modified classes between these two commits per pull request, I use 'git' package, like this:
I cloned gson repository in D:\\projects\\gson
import git
repo = git.Repo("D:\\projects\\gson")
commits_list = list(repo.iter_commits())
temp = []
for x in commits_list[0].diff(commits_list[-1]):
if (x.a_path == x.b_path):
if x.a_path.endswith('.java'):
temp.append(x.a_path)
else:
if x.b_path.endswith('.java'):
temp.append(x.b_path)
Here is how I extract commits form GitHub API:
projects = [ {'owner':'google', 'repo':'gson', 'pull_requests': []}]
def get(url):
global nb
PARAMS = {
'client_id': '----my_client_id---',
'client_secret': '---my_client_secret---',
'per_page': 100,
'state': 'all' #open, closed, all
}
result = requests.get(url = url, params=PARAMS)
nb+=1
if(not (result.status_code in [200, 304])):
raise Exception('request error', url, result, result.headers)
data = result.json()
while 'next' in result.links.keys():
result = requests.get(url = result.links['next']['url'],
params=PARAMS)
data.extend(result.json())
nb+=1
return data
def get_pull_requests(repo):
url = 'https://api.github.com/repos/{}/pulls'.format(repo)
result = get(url)
return result
def get_commits(url):
result = get(url)
return result
for i,project in enumerate(projects):
project['pull_requests'] =
get_pull_requests('{}/{}'.format(project['owner'],project['repo']))
for p in project['pull_requests']:
p['commits'] = get_commits(p['commits_url'])
print('{}/{}'.format(project['owner'],project['repo']), ':',
len(project['pull_requests']))
Each of these two codes works. The problem is, I get 287 commits from GitHub API, but only 86 commits from git.Repo for the same project. When I try to match these commits, less then 40 commits matches.
Questions:
1) Why am I getting different commits for the same project?
2) Which one is correct and I should use?
3) Is there a way I can know what commits are for what pull request using Git.Repo ?
4) Is there a way I can extract the modified classes between two commits in GitHub API?
5) Dose anyone know of a better way of extracting modified classes per pull request?
I know this is a long post, but I tried to be specific here. The answer to any of these questions would be very much appreciated.
I am trying to use python for my jenkins job, this job downloads and refreshes a line in the project then commits and creates a pull request, I am trying read the documentation for GitPython as hard as I can but my inferior brain is not able to make any sense out of it.
import git
import os
import os.path as osp
path = "banana-post/infrastructure/"
repo = git.Repo.clone_from('https://github.myproject.git',
osp.join('/Users/monkeyman/PycharmProjects/projectfolder/', 'monkey-post'), branch='banana-refresh')
os.chdir(path)
latest_banana = '123456'
input_file_name = "banana.yml"
output_file_name = "banana.yml"
with open(input_file_name, 'r') as f_in, open(output_file_name, 'w') as f_out:
for line in f_in:
if line.startswith("banana_version:"):
f_out.write("banana_version: {}".format(latest_banana))
f_out.write("\n")
else:
f_out.write(line)
os.remove("deploy.yml")
os.rename("deploy1.yml", "banana.yml")
files = repo.git.diff(None, name_only=True)
for f in files.split('\n'):
repo.git.add(f)
repo.git.commit('-m', 'This an Auto banana Refresh, contact bannana#monkey.com',
author='moneky#banana.com')
After committing this change I am trying to push this change and create a pull request from branch='banana-refresh' to branch='banana-integration'.
GitPython is only a wrapper around Git. I assume you are wanting to create a pull request in a Git hosting service (Github/Gitlab/etc.).
You can't create a pull request using the standard git command line. git request-pull, for example, only Generates a summary of pending changes. It doesn't create a pull request in GitHub.
If you want to create a pull request in GitHub, you can use the PyGithub library.
Or make a simple HTTP request to the Github API with the requests library:
import json
import requests
def create_pull_request(project_name, repo_name, title, description, head_branch, base_branch, git_token):
"""Creates the pull request for the head_branch against the base_branch"""
git_pulls_api = "https://github.com/api/v3/repos/{0}/{1}/pulls".format(
project_name,
repo_name)
headers = {
"Authorization": "token {0}".format(git_token),
"Content-Type": "application/json"}
payload = {
"title": title,
"body": description,
"head": head_branch,
"base": base_branch,
}
r = requests.post(
git_pulls_api,
headers=headers,
data=json.dumps(payload))
if not r.ok:
print("Request Failed: {0}".format(r.text))
create_pull_request(
"<your_project>", # project_name
"<your_repo>", # repo_name
"My pull request title", # title
"My pull request description", # description
"banana-refresh", # head_branch
"banana-integration", # base_branch
"<your_git_token>", # git_token
)
This uses the GitHub OAuth2 Token Auth and the GitHub pull request API endpoint to make a pull request of the branch banana-refresh against banana-integration.
It appears as though pull requests have not been wrapped by this library.
You can call the git command line directly as per the documentation.
repo.git.pull_request(...)
I have followed frederix's answer also used https://api.github.com/repos/ instead of https://github.com/api/v3/repos/
Also use owner_name instead of project_name