Scrape latitude and longitude locations obtained from Mapbox - python

I'm working on a divvy dataset project.
I want to scrape information for each suggestion location and comments provided from here http://suggest.divvybikes.com/.
Am I able to scrape this information from Mapbox? It is displayed on a map so it must have the information somewhere.

I visited the page, and logged my network traffic using Google Chrome's Developer Tools. Filtering the requests to view only XHR (XmlHttpRequest) requests, I saw a lot of HTTP GET requests to various REST APIs. These REST APIs return JSON, which is ideal. Only two of these APIs seem to be relevant for your purposes - one is for places, the other for comments associated with those places. The places API's JSON contains interesting information, such as place ids and coordinates. The comments API's JSON contains all comments regarding a specific place, identified by its id. Mimicking those calls is pretty straightforward with the third-party requests module. Fortunately, the APIs don't seem to care about request headers. The query-string parameters (the params dictionary) need to be well-formulated though, of course.
I was able to come up with the following two functions: get_places makes multiple calls to the same API, each time with a different page query-string parameter. It seems that "page" is the term they use internally to split up all their data into different chunks - all the different places/features/stations are split up across multiple pages, and you can only get one page per API call. The while-loop accumulates all places in a giant list, and it keeps going until we receive a response which tells us there are no more pages. Once the loop ends, we return the list of places.
The other function is get_comments, which takes a place id (string) as a parameter. It then makes an HTTP GET request to the appropriate API, and returns a list of comments for that place. This list may be empty if there are no comments.
def get_places():
import requests
from itertools import count
api_url = "http://suggest.divvybikes.com/api/places"
page_counter = count(1)
places = []
for page_nr in page_counter:
params = {
"page": str(page_nr),
"include_submissions": "true"
}
response = requests.get(api_url, params=params)
response.raise_for_status()
content = response.json()
places.extend(content["features"])
if content["metadata"]["next"] is None:
break
return places
def get_comments(place_id):
import requests
api_url = "http://suggest.divvybikes.com/api/places/{}/comments".format(place_id)
response = requests.get(api_url)
response.raise_for_status()
return response.json()["results"]
def main():
from operator import itemgetter
places = get_places()
place_id = places[12]["id"]
print("Printing comments for the thirteenth place (id: {})\n".format(place_id))
for comment in map(itemgetter("comment"), get_comments(place_id)):
print(comment)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Printing comments for the thirteenth place (id: 107062)
I contacted Divvy about this five years ago and would like to pick the conversation back up! The Evanston Divvy bikes are regularly spotted in Wilmette and we'd love to expand the system for riders. We could easily have four stations - at the Metra Train Station, and the CTA station, at the lakefront Gillson Park and possibly one at Edens Plaza in west Wilmette. Please, please, please contact me directly. Thanks.
>>>
For this example, I'm printing all the comments for the 13th place in our list of places. I picked that one because it is the first place which actually has comments (0 - 11 didn't have any comments, most places don't seem to have comments). In this case, this place only had one comment.
EDIT - If you wanted to save the place ids, latitude, longitude and comments in a CSV, you can try changing the main function to:
def main():
import csv
print("Getting places...")
places = get_places()
print("Got all places.")
fieldnames = ["place id", "latitude", "longitude", "comments"]
print("Writing to CSV file...")
with open("output.csv", "w") as file:
writer = csv.DictWriter(file, fieldnames)
writer.writeheader()
num_places_to_write = 25
for place_nr, place in enumerate(places[:num_places_to_write], start=1):
print("Writing place #{}/{} with id {}".format(place_nr, num_places_to_write, place["id"]))
writer.writerow(dict(zip(fieldnames, [place["id"], *place["geometry"]["coordinates"], [c["comment"] for c in get_comments(place["id"])]])))
return 0
With this, I got results like:
place id,latitude,longitude,comments
107098,-87.6711076553,41.9718155716,[]
107097,-87.759540081,42.0121073671,[]
107096,-87.747695446,42.0263916146,[]
107090,-87.6642036438,42.0162096564,[]
107089,-87.6609444613,41.8852953922,[]
107083,-87.6007853815,41.8199433342,[]
107082,-87.6355862613,41.8532736671,[]
107075,-87.6210737228,41.8862644836,[]
107074,-87.6210737228,41.8862644836,[]
107073,-87.6210737228,41.8862644836,[]
107065,-87.6499611139,41.9627251578,[]
107064,-87.6136027649,41.8332984674,[]
107062,-87.7073025402,42.0760990584,"[""I contacted Divvy about this five years ago and would like to pick the conversation back up! The Evanston Divvy bikes are regularly spotted in Wilmette and we'd love to expand the system for riders. We could easily have four stations - at the Metra Train Station, and the CTA station, at the lakefront Gillson Park and possibly one at Edens Plaza in west Wilmette. Please, please, please contact me directly. Thanks.""]"
In this case, I used the list-slicing syntax (places[:num_places_to_write]) to only write the first 25 places to the CSV file, just for demonstration purposes. However, after about the first thirteen were written, I got this exception message:
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
So, I'm guessing that the comment-API doesn't expect to receive so many requests in such a short amount of time. You may have to sleep in the loop for a bit to get around this. It's also possible that the API doesn't care, and just happened to timeout.

Related

Getting the Protein names and their ID for a given list of peptide sequence (using Python)

I have a list of peptide sequence, I want to map it to the correct protein names from any Open Database like Uniprot, i.e., peptides belonging to the proteins. Can someone guide how to find the protein names and map them, thanks in advance.
I'd say your best bet is to use the requests module and hook into the API that Uniprot has on their website. The API for peptide sequence searching is here, and the docs for it link from the same page.
With this, you should be able to form a dict that contains your search parameters and send a request to the API that will return the results you are looking for. The requests module allows you to retrieve the results as json format, which you can very easily parse back into lists/dicts, etc for use in whatever way you wish.
Edit: I have code!
Just for fun, I tried the first part: looking up the proteins using the peptides. This works! You can see how easy the requests module makes this sort of thing :)
There is another API for retrieving the database entries once you have the list of "accessions" from this first step. All of the API end points and docs can be accessed here. I think you want this one.
import requests
from time import sleep
url = 'https://research.bioinformatics.udel.edu/peptidematchws/asyncrest'
#peps can be a comma separated list for multiple peptide sequences
data={'peps':'MKTLLLTLVVVTIVCLDLGYT','lEQi':'off','spOnly':'off'}
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
response = requests.post(url,params=data,headers=headers)
if response.status_code == 202:
print(f"Search accepted. Results at {response.headers['Location']}")
search_job = requests.get(response.headers['Location'])
while search_job.status_code == 303:
sleep(30)
search_job = requests.get(response.headers['Location'])
if search_job.status_code == 200:
results = search_job.text.split(',')
print('Results found:')
print(results)
else:
print('No matches found')
else:
print('Error Search not accepted')
print(response.status_code, response.reason)

Web Scraping AccuWeather site

I have recently started learning Web scraping using Scrapy in python and am facing issues with scraping data from AccuWeather.org site (https://www.accuweather.com/en/gb/london/ec4a-2/may-weather/328328?year=2020).
Basically I am capturing dates and its weather temperature for my reporting purpose.
When inspected the site I found too many div tags so getting confused to write the code. Hence thought I would seek experts help on this.
Here is my code for your reference.
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['https://www.accuweather.com/en/gb/london/ec4a-2/may-weather/328328?year=2020']
def parse(self, response):
All_div_tags = response.css('div.content-module')[0]
#Grid_tag = All_div_tags.css('div.monthly-grid')
Date_tag = All_div_tags.css('div.date::text').extract()
yield {
'Date' : Date_tag}
I wrote this in PyCharm and am getting error as "code is not handled or not allowed".
please could someone help me with this?
I've tried to read some websites that gave me the same error. It happens because some websites don't allow web scraping on them. To get data from these websites, you would probably need to use their API if they have one.
Fortunately, AccuWeather has made it easy to use their API (unlike other APIs):
You first need to create an account at their developers' website: https://developer.accuweather.com/
Now, create a new app by going to My Apps > Add a new app.
You will probably see some information about your app (if you don't, press its name and it will probably show up). The only information you will need is your API Key, which is essential for APIs.
AccuWeather has pretty good documentation about their API here, yet I will show you how to use the most useful ones. You will need to have the location key of the city you want to get the weather from, that is shown in the URL of its weather page, for example, London's URL is www.accuweather.com/en/gb/london/ec4a-2/weather-forecast/328328, so its location key is 328328.
When you have the location key of the city/cities you want to get the weather from, open a file, and type:
import requests
import json
If you want the daily weather (as shown here), type:
response = requests.get(url="http://dataservice.accuweather.com/forecasts/v1/daily/1day/LOCATIONKEY?apikey=APIKEY")
print(response.status_code)
Replacing APIKEY with your API key, and LOCATIONKEY with the city's location key. It should now display 200 when you run it (meaning the request was successful)
Now, load it as a JSON file:
response_json = json.loads(response.content)
And you can now get some information from it, such as the day's "definition":
print(response_json["Headline"]["Text"])
The minimum temperature:
min_temperature = response_json["DailyForecasts"][0]["Temperature"]["Minimum"]["Value"]
print(f"Minimum Temperature: {min_temperature}")
The maximum temperature
max_temperature = response_json["DailyForecasts"][0]["Temperature"]["Maximum"]["Value"]
print(f"Maximum Temperature: {max_temperature}")
The minimum temperature and maximum temperature with the unit:
min_temperature = str(response_json["DailyForecasts"][0]["Temperature"]["Minimum"]["Value"]) + response_json["DailyForecasts"][0]["Temperature"]["Minimum"]["Unit"]
print(f"Minimum Temperature: {min_temperature}")
max_temperature = str(response_json["DailyForecasts"][0]["Temperature"]["Maximum"]["Value"]) + response_json["DailyForecasts"][0]["Temperature"]["Maximum"]["Unit"]
print(f"Maximum Temperature: {max_temperature}")
And more.
If you have any questions, let me know. I hope I could help you!

Imgur API - How do I retrieve all favorites without pagination?

According to the Imgur Docs, the "GET Account Favorites" API call takes optional arguments for pagination, implying that all objects are returned without it.
However, when I use the following code snippet (the application has been registered and OAuth has already performed against my account for testing), I get only the first 30 JSON objects. In the snippet below, I already have an access_token for an authorized user and can retrieve data for that username. But the length of the returned list is always the first 30 items.
username = token['username']
bearer_headers = {
'Authorization': 'Bearer ' + token['access_token']
}
fav_url = 'https://api.imgur.com/3/account/' + username + '/' + 'favorites'
r = requests.get(fav_url, headers=bearer_headers)
r_json = r.json()
favorites=r_json['data']
len(favorites)
print(favorites)
The requests response returns a dictionary with three keys: status (the HTTP status code), success (true or false), and data, of which the value is a list of dictionaries (one per favorited item).
I'm trying to retrieve this without pagination so I can extract specific metadata values into a Pandas dataframe (id, post date, etc).
I originally thought this was a Pandas display problem in Jupyter notebook, but tracked it back to the API only returning the newest 30 list items, despite the docs indicating otherwise. If I place an arbitrary page number at the end (eg, "/favorites/1"), it returns the 30 items appropriate to that page, but there doesn't seem to be an option to get all items or retrieve a count of the total items or number of pages in advance.
What am I missing?
Postscript: It appears that none of the URIs work without pagination, eg, get account images, get gallery submissions, etc. Anything where there is an optional "/{{page}}" parameter, it will default to first page if none is specified. So I guess the larger question is, "does Imgur API even support non-paginated data, and how is that accessed?".
Paginated data is usually used when the possible size of the response can be arbitrarily large. I would be surprised if a major service like Imgur had an API that didn't work this way.
As you have found, the page attribute may be optional, and if you don't provide it, you get the first page as your response.
If you want to get more than the first page, you will need to loop over the page number:
data = []
page = 0
while block := connection.get(page=page):
data.append(block)
page += 1
This assumes Python3.8+ due to the := assignment expression. If you are on an older version you'll need to set block in the loop body, but the same idea applies.

How can I detect the method to request data from this site?

UPDATE: I've put together the following script to use the url for the XML without the time-code-like suffix as recommended in the answer below, and report the downlink powers which clearly fluctuate on the website. I'm getting three hour old, unvarying data.
So it looks like I need to properly construct that (time code? authorization? secret password?) in order to do this successfully. Like I say in the comment below, "I don't want to do anything that's not allowed and welcome - NASA has enough challenges already trying to talk to a forty year old spacecraft 20 billion kilometers away!"
def dictify(r,root=True):
"""from: https://stackoverflow.com/a/30923963/3904031"""
if root:
return {r.tag : dictify(r, False)}
d=copy(r.attrib)
if r.text:
d["_text"]=r.text
for x in r.findall("./*"):
if x.tag not in d:
d[x.tag]=[]
d[x.tag].append(dictify(x,False))
return d
import xml.etree.ElementTree as ET
from copy import copy
import urllib2
url = 'https://eyes.nasa.gov/dsn/data/dsn.xml'
contents = urllib2.urlopen(url).read()
root = ET.fromstring(contents)
DSNdict = dictify(root)
dishes = DSNdict['dsn']['dish']
dp_dict = dict()
for dish in dishes:
powers = [float(sig['power']) for sig in dish['downSignal'] if sig['power']]
dp_dict[dish['name']] = powers
print dp_dict['DSS26']
I'd like to keep track of which spacecraft that the NASA Deep Space Network (DSN) is communicating with, say once per minute.
I learned how to do something similar from Flight Radar 24 from the answer to my previous question, which also still represents my current skills in getting data from web sites.
With FR24 I had explanations in this blog as a great place to start. I have opened the page with the Developer Tools function in the Chrome browser, and I can see that data for items such as dishes, spacecraft and associated numerical data are requested as an XML with urls such as
https://eyes.nasa.gov/dsn/data/dsn.xml?r=293849023
so it looks like I need to construct the integer (time code? authorization? secret password?) after the r= once a minute.
My Question: Using python, how could I best find out what that integer represents, and how to generate it in order to correctly request data once per minute?
above: screen shot montage from NASA's DSN Now page https://eyes.nasa.gov/dsn/dsn.html see also this question
Using a random number (or a timestamp...) in a get parameter tricks the browser into really making the request (instead of using the browser cache).
This method is some kind of "hack" the webdevs use so that they are sure the request actually happens.
Since you aren't using a web browser, I'm pretty sure you could totally ignore this parameter, and still get the refreshed data.
--- Edit ---
Actually r seems to be required, and has to be updated.
#!/bin/bash
wget https://eyes.nasa.gov/dsn/data/dsn.xml?r=$(date +%s) -O a.xml -nv
while true; do
sleep 1
wget https://eyes.nasa.gov/dsn/data/dsn.xml?r=$(date +%s) -O b.xml -nv
diff a.xml b.xml
cp b.xml a.xml -f
done
You don't need to emulate a browser. Simply set r to anything and increment it. (Or use a timestamp)
Regarding your updated question, why avoid sending the r query string parameter when it is very easy to generate it? Also, with the requests module, it's easy to send the parameter with the request too:
import time
import requests
import xml.etree.ElementTree as ET
url = 'https://eyes.nasa.gov/dsn/data/dsn.xml'
r = int(time.time() / 5)
response = requests.get(url, params={'r': r})
root = ET.fromstring(response.content)
# etc....

Discogs API => How to retrieve genre?

I've crawled a tracklist of 36.000 songs, which have been played on the Danish national radio station P3. I want to do some statistics on how frequently each of the genres have been played within this period, so I figured the discogs API might help labeling each track with genre. However, the documentation for the API doesent seem to include an example for querying the genre of a particular song.
I have a CSV-file with with 3 columns: Artist, Title & Test(Test where i want the API to label each song with the genre).
Here's a sample of the script i've built so far:
import json
import pandas as pd
import requests
import discogs_client
d = discogs_client.Client('ExampleApplication/0.1')
d.set_consumer_key('key-here', 'secret-here')
input = pd.read_csv('Desktop/TEST.csv', encoding='utf-8',error_bad_lines=False)
df = input[['Artist', 'Title', 'Test']]
df.columns = ['Artist', 'Title','Test']
for i in range(0, len(list(df.Artist))):
x = df.Artist[i]
g = d.artist(x)
df.Test[i] = str(g)
df.to_csv('Desktop/TEST2.csv', encoding='utf-8', index=False)
This script has been working with a dummy file with 3 records in it so far, for mapping the artist of a given ID#. But as soon as the file gets larger(ex. 2000), it returns a HTTPerror when it cannot find the artist.
I have some questions regarding this approach:
1) Would you recommend using the search query function in the API for retrieving a variable as 'Genre'. Or do you think it is possible to retrieve Genre with a 'd.' function from the API?
2) Will I need to aquire an API-key? I have succesfully mapped the 3 records without an API-key so far. Looks like the key is free though.
Here's the guide I have been following:
https://github.com/discogs/discogs_client
And here's the documentation for the API:
https://www.discogs.com/developers/#page:home,header:home-quickstart
Maybe you need to re-read the discogs_client examples, i am not an expert myself, but a newbie trying to use this API.
AFAIK, g = d.artist(x) fails because x must be a integer not a string.
So you must first do a search, then get the artist id, then d.artist(artist_id)
Sorry for no providing an example, i am python newbie right now ;)
Also have you checked acoustid for
It's a probably a rate limit.
Read the status code of your response, you should find an 429 Too Many Requests
Unfortunately, if that's the case, the only solution is to add a sleep in your code to make one request per second.
Checkout the api doc:
http://www.discogs.com/developers/#page:home,header:home-rate-limiting
I found this guide:
https://github.com/neutralino1/discogs_client.
Access the api with your key and try something like:
d = discogs_client.Client('something.py', user_token=auth_token)
release = d.release(774004)
genre = release.genres
If you found a better solution please share.

Categories

Resources