Getting quantity of issues through github.api - python

My task is to get the number of open issues using github.api. Unfortunately, when I parsing any repositories, I get the same number - 30.
import requests
r = requests.get('https://api.github.com/repos/grpc/grpc/issues')
count = 0
for item in r.json():
if item['state'] == 'open':
count += 1
print(count)
Is there any way to get a real quantity of issues?

See the documentation about the Link response header, also you can pass the state or filters.
https://developer.github.com/v3/guides/traversing-with-pagination/
https://developer.github.com/v3/issues/
You'll have to page through.
http://.../issues?page=1&state=open
http://.../issues?page=2&state=open

The /issues/ endpoint is paginated: it means that you will have to iterate through several pages to get all the issues.
https://api.github.com/repos/grpc/grpc/issues?page=1
https://api.github.com/repos/grpc/grpc/issues?page=2
...
But there is a better way to get what you want. The GET /repos/:owner/:repo directly gives the number of open issues on a repository.
For instance, on https://api.github.com/repos/grpc/grpc, you can see:
"open_issues_count": 1052,
Click here to have a look at the documentation for this endpoint.

Related

Scrape latitude and longitude locations obtained from Mapbox

I'm working on a divvy dataset project.
I want to scrape information for each suggestion location and comments provided from here http://suggest.divvybikes.com/.
Am I able to scrape this information from Mapbox? It is displayed on a map so it must have the information somewhere.
I visited the page, and logged my network traffic using Google Chrome's Developer Tools. Filtering the requests to view only XHR (XmlHttpRequest) requests, I saw a lot of HTTP GET requests to various REST APIs. These REST APIs return JSON, which is ideal. Only two of these APIs seem to be relevant for your purposes - one is for places, the other for comments associated with those places. The places API's JSON contains interesting information, such as place ids and coordinates. The comments API's JSON contains all comments regarding a specific place, identified by its id. Mimicking those calls is pretty straightforward with the third-party requests module. Fortunately, the APIs don't seem to care about request headers. The query-string parameters (the params dictionary) need to be well-formulated though, of course.
I was able to come up with the following two functions: get_places makes multiple calls to the same API, each time with a different page query-string parameter. It seems that "page" is the term they use internally to split up all their data into different chunks - all the different places/features/stations are split up across multiple pages, and you can only get one page per API call. The while-loop accumulates all places in a giant list, and it keeps going until we receive a response which tells us there are no more pages. Once the loop ends, we return the list of places.
The other function is get_comments, which takes a place id (string) as a parameter. It then makes an HTTP GET request to the appropriate API, and returns a list of comments for that place. This list may be empty if there are no comments.
def get_places():
import requests
from itertools import count
api_url = "http://suggest.divvybikes.com/api/places"
page_counter = count(1)
places = []
for page_nr in page_counter:
params = {
"page": str(page_nr),
"include_submissions": "true"
}
response = requests.get(api_url, params=params)
response.raise_for_status()
content = response.json()
places.extend(content["features"])
if content["metadata"]["next"] is None:
break
return places
def get_comments(place_id):
import requests
api_url = "http://suggest.divvybikes.com/api/places/{}/comments".format(place_id)
response = requests.get(api_url)
response.raise_for_status()
return response.json()["results"]
def main():
from operator import itemgetter
places = get_places()
place_id = places[12]["id"]
print("Printing comments for the thirteenth place (id: {})\n".format(place_id))
for comment in map(itemgetter("comment"), get_comments(place_id)):
print(comment)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Printing comments for the thirteenth place (id: 107062)
I contacted Divvy about this five years ago and would like to pick the conversation back up! The Evanston Divvy bikes are regularly spotted in Wilmette and we'd love to expand the system for riders. We could easily have four stations - at the Metra Train Station, and the CTA station, at the lakefront Gillson Park and possibly one at Edens Plaza in west Wilmette. Please, please, please contact me directly. Thanks.
>>>
For this example, I'm printing all the comments for the 13th place in our list of places. I picked that one because it is the first place which actually has comments (0 - 11 didn't have any comments, most places don't seem to have comments). In this case, this place only had one comment.
EDIT - If you wanted to save the place ids, latitude, longitude and comments in a CSV, you can try changing the main function to:
def main():
import csv
print("Getting places...")
places = get_places()
print("Got all places.")
fieldnames = ["place id", "latitude", "longitude", "comments"]
print("Writing to CSV file...")
with open("output.csv", "w") as file:
writer = csv.DictWriter(file, fieldnames)
writer.writeheader()
num_places_to_write = 25
for place_nr, place in enumerate(places[:num_places_to_write], start=1):
print("Writing place #{}/{} with id {}".format(place_nr, num_places_to_write, place["id"]))
writer.writerow(dict(zip(fieldnames, [place["id"], *place["geometry"]["coordinates"], [c["comment"] for c in get_comments(place["id"])]])))
return 0
With this, I got results like:
place id,latitude,longitude,comments
107098,-87.6711076553,41.9718155716,[]
107097,-87.759540081,42.0121073671,[]
107096,-87.747695446,42.0263916146,[]
107090,-87.6642036438,42.0162096564,[]
107089,-87.6609444613,41.8852953922,[]
107083,-87.6007853815,41.8199433342,[]
107082,-87.6355862613,41.8532736671,[]
107075,-87.6210737228,41.8862644836,[]
107074,-87.6210737228,41.8862644836,[]
107073,-87.6210737228,41.8862644836,[]
107065,-87.6499611139,41.9627251578,[]
107064,-87.6136027649,41.8332984674,[]
107062,-87.7073025402,42.0760990584,"[""I contacted Divvy about this five years ago and would like to pick the conversation back up! The Evanston Divvy bikes are regularly spotted in Wilmette and we'd love to expand the system for riders. We could easily have four stations - at the Metra Train Station, and the CTA station, at the lakefront Gillson Park and possibly one at Edens Plaza in west Wilmette. Please, please, please contact me directly. Thanks.""]"
In this case, I used the list-slicing syntax (places[:num_places_to_write]) to only write the first 25 places to the CSV file, just for demonstration purposes. However, after about the first thirteen were written, I got this exception message:
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
So, I'm guessing that the comment-API doesn't expect to receive so many requests in such a short amount of time. You may have to sleep in the loop for a bit to get around this. It's also possible that the API doesn't care, and just happened to timeout.

Forex-python "Currency Rates Source Not Ready"

I want to use the Forex-python module to convert amounts in various currencies to a specific currency ("DKK") according to a specific date (The last day of a previous month according to a date in the dataframe)
This is the structure of my code:
pd.DataFrame(data={'Date':['2017-4-15','2017-6-12','2017-2-25'],'Amount':[5,10,15],'Currency':['USD','SEK','EUR']})
def convert_rates(amount,currency,PstngDate):
PstngDate = datetime.strptime(PstngDate, '%Y-%m-%d')
if currency != 'DKK':
return c.convert(base_cur=currency,dest_cur='DKK',amount=amount \
,date_obj=PstngDate - timedelta(PstngDate.day))
else:
return amount
and the the new column with the converted amount:
df['Amount, DKK'] = np.vectorize(convert_rates)(
amount=df['Amount'],
currency=df['Currency'],
PstngDate=df['Date']
)
I get the RatesNotAvailableError "Currency Rates Source Not Ready"
Any idea what can cause this? It has previously worked with small amounts of data, but I have many rows in my real df...
I inserted a small print statement into convert.py (part of forex-python) to debug this.
print(response.status_code)
Currently I receive:
502
Read these threads about the HTTP 502 error:
In HTTP 502, what is meant by an invalid response?
https://www.lifewire.com/502-bad-gateway-error-explained-2622939
These errors are completely independent of your particular setup,
meaning that you could see one in any browser, on any operating
system, and on any device.
502 indicates that currently there is a problem with the infrastructure this API uses to provide us with the required data. As I am in need of the data myself I will continue to monitor this issue and keep my post on this site updated.
There is already an issue on Github regarding this issue:
https://github.com/MicroPyramid/forex-python/issues/100
From the source: https://github.com/MicroPyramid/forex-python/blob/80290a2b9150515e15139e1a069f74d220c6b67e/forex_python/converter.py#L73
Your error means the library received a non 200 response code to your request. This could mean the site is down, or it's blocked you momentarily because you're hammering it with requests.
Try replacing the call to c.convert with something like:
from time import sleep
def try_convert(amount, currency, PstngDate):
success = False
while success == False:
try:
res = c.convert(base_cur=currency,dest_cur='DKK',amount=amount \
,date_obj=PstngDate - timedelta(PstngDate.day))
except:
#wait a while
sleep(10)
return res
Or even better, use a library like backoff, to do the retrying for you:
https://pypi.python.org/pypi/backoff/1.3.1

Soundcloud API python issues with linked partitioning

On the Soundcloud API guide (https://developers.soundcloud.com/docs/api/guide#pagination)
the example given for reading more than 100 piece of data is as follows:
# get first 100 tracks
tracks = client.get('/tracks', order='created_at', limit=page_size)
for track in tracks:
print track.title
# start paging through results, 100 at a time
tracks = client.get('/tracks', order='created_at', limit=page_size,
linked_partitioning=1)
for track in tracks:
print track.title
I'm pretty certain this is wrong as I found that 'tracks.collection' needs referencing rather than just 'tracks'. Based on the GitHub python soundcloud API wiki it should look more like this:
tracks = client.get('/tracks', order='created_at',limit=10,linked_partitioning=1)
while tracks.collection != None:
for track in tracks.collection:
print(track.playback_count)
tracks = tracks.GetNextPartition()
Where I have removed the indent from the last line (I think there is an error on the wiki it is within the for loop which makes no sense to me). This works for the first loop. However, this doesn't work for successive pages because the "GetNextPartition()" function is not found. I've tried the last line as:
tracks = tracks.collection.GetNextPartition()
...but no success.
Maybe I'm getting versions mixed up? But I'm trying to run this with Python 3.4 after downloading the version from here: https://github.com/soundcloud/soundcloud-python
Any help much appreciated!
For anyone that cares, I found this solution on the SoundCloud developer forum. It is slightly modified from the original case (searching for tracks) to list my own followers. The trick is to call the client.get function repeatedly, passing the previously returned "users.next_href" as the request that points to the next page of results. Hooray!
pgsize=200
c=1
me = client.get('/me')
#first call to get a page of followers
users = client.get('/users/%d/followers' % me.id, limit=pgsize, order='id',
linked_partitioning=1)
for user in users.collection:
print(c,user.username)
c=c+1
#linked_partitioning means .next_href exists
while users.next_href != None:
#pass the contents of users.next_href that contains 'cursor=' to
#locate next page of results
users = client.get(users.next_href, limit=pgsize, order='id',
linked_partitioning=1)
for user in users.collection:
print(c,user.username)
c=c+1

SoundCloud API - Playback Count is smaller than actual count

I am using soundcloud api through python SDK.
When I get the tracks data through 'Search',
the track attribute 'playback_count' seems to be
smaller than the actual count seen on the web.
How can I avoid this problem and get the actual playback_count??
(ex.
this track's playback_count gives me 2700,
but its actually 15k when displayed on the web
https://soundcloud.com/drumandbassarena/ltj-bukem-soundcrash-mix-march-2016
)
note: this problem does not occur for comments or likes.
following is my code
##Search##
tracks = client.get('/tracks', q=querytext, created_at={'from':startdate},duration={'from':startdur},limit=200)
outputlist = []
trackinfo = {}
resultnum = 0
for t in tracks:
trackinfo = {}
resultnum += 1
trackinfo["id"] = resultnum
trackinfo["title"] =t.title
trackinfo["username"]= t.user["username"]
trackinfo["created_at"]= t.created_at[:-5]
trackinfo["genre"] = t.genre
trackinfo["plays"] = t.playback_count
trackinfo["comments"] = t.comment_count
trackinfo["likes"] =t.likes_count
trackinfo["url"] = t.permalink_url
outputlist.append(trackinfo)
There is an issue with the playback count being incorrect when reported via the API.
I have encountered this when getting data via the /me endpoint for activity and likes to mention a couple.
The first image shows the information returned when accessing the sound returned for the currently playing track in the soundcloud widget
Information returned via the api for the me/activities endpoint
Looking at the SoundCloud website, they actually call a second version of the API to populate the track list on the user page. It's similar to the documented version, but not quite the same.
If you issue a request to https://api-v2.soundcloud.com/stream/users/[userid]?limit=20&client_id=[clientid] then you'll get back a JSON object showing the same numbers you see on the web.
Since this is an undocumented version, I'm sure it'll change the next time they update their website.

how to find the last available url which does not return 302 (Redirect) status code in a url list quickly

Now I am facing a problem like this:
Say I have a list of urls, e.g.
['http://example.com/1',
'http://example.com/2',
'http://example.com/3',
'http://example.com/4',
...,
'http://example.com/100']
And some of them are unavailable urls, requesting for those urls will result in 302 redirect status code. e.g. .../1 - .../50 are available urls, but .../51 will cause 302. Then .../50 is the url I want.
I want to find out which url is the last availble url (which does not return 302 code), I believe binary search will do the work, but I wonder how to implement it with better efficiency. I use python's urllib2 to detect 302 status code.
p.s. e.g. .../1 - .../50 are available urls, but .../51 will cause 302. Then .../50 is the url I want.
This answer makes the assumption that your URLs are currently ordered in a meaningful way, and that all URLs up to some value n will be available and all URLs after n will result in a 302.
If this is the case, then you can adapt this binary search answer to fit your needs:
import requests
def binary_search_urls(urls, lo=0, hi=None):
if hi is None:
hi = len(urls)
while lo < hi:
mid = (lo+hi)//2
status = requests.head(urls[mid]).status_code
if status != 302:
lo = mid+1
else:
hi = mid
return lo - 1
This will give you the index of the last good URL, or -1 if there are no good URLs.
I would just check the entire lot, however I would use requests instead of urllib2 and make sure to only make HEAD requests to keep bandwith down (which is possibly going to be your greatest bottle neck anyway).
import requests
urls = [...]
results = [(url, requests.head(url).status_code) for url in urls]
Then go from there...
I don't see how a binary search could be at all faster than straight in order iteration, and in most cases, it would end up being slower. Given n is the length of the list, if you are searching for the last good url of the first good batch, only the case where urls[n/2]-1 is your target would take the same number of searches as just brute force iteration; all others would take more. If you are looking for the last good url in the entire list, the only search target that would take the same number of searches compared to a reversed order iteration would be urls[n/2]-1. Binary search is only faster when your dataset is ordered. For an unordered dataset, sampling the middle of the set tells you nothing about being able to exclude values to either side of it, so you still have to process the entire sequence before you can tell anything.
I suspect what you may really be wanting here is a way to sample your dataset at intervals so that you can run fewer requests before finding your target, which isn't quite the same as a binary search. Binary search relies on the fact that sampling a point in your sequence provides information on being able to exclude either one side or the other of the sequence from subsequent searches based upon a binary condition. What you have is a system where if a sample fails the test, you can exclude one side, but if it passes the test, it tells you nothing about what you can assume about any other values in the list. That doesn't really work for a binary search.

Categories

Resources