Access bulk business data via Google Maps API with python

Access bulk business data via Google Maps API with python - python

I'm trying to access google maps data from Google maps API. This code generates an empty xlsx file. Instead, it should return data. What do you think I do wrong?
Thanks in advance for help.
import googlemaps
import pandas as pd
import time
API_KEY = open('API_KEY.txt', 'r').read()
map_client = googlemaps.Client(API_KEY)
location = (53.53921005946531, -113.50032556307899)
search_string = 'halal restaurants'
distance = 25
edmonton_restaurants = []
response = map_client.places_nearby(
location = location,
keyword= search_string,
name='halal_restaurants',
radius= distance
)
edmonton_restaurants.extend(response.get('results'))
next_page_token = response.get('next_page_token')
while next_page_token:
time.sleep(3)
response = map_client.places_nearby(
location = location,
keyword= search_string,
name='halal_restaurants',
radius= distance,
page_token=next_page_token
)
edmonton_restaurants.extend(response.get('results'))
next_page_token = response.get('next_page_token')
df = pd.DataFrame(edmonton_restaurants)
df.to_excel('edmoton_rest.xlsx'.format(search_string), index=False)

You are using the wrong API. You are using the Places Nearby API, which is not the same as the Places API. The Places Nearby API is for finding places near a location, not for searching for places. You need to use the Places API. You can find the documentation here.
The Places API has a different response format than the Places Nearby API. The Places API returns a list of places, while the Places Nearby API returns a list of places and a next page token. You can find the documentation for the Places API here.
The correct code would have the following replacement:
response = map_client.places(
query = search_string,
location = location,
radius= distance
)
edmonton_restaurants.extend(response.get('results'))
next_page_token = response.get('next_page_token')
while next_page_token:
time.sleep(3)
response = map_client.places(
query = search_string,
location = location,
radius= distance,
page_token=next_page_token
)
edmonton_restaurants.extend(response.get('results'))
next_page_token = response.get('next_page_token')
df = pd.DataFrame(edmonton_restaurants)

Related

Parallelize openalex API

I am using the openalex API see here for an example to get all the papers of 2020.
The way I am doing this is the following:
import requests
# url with a placeholder for cursor
example_url_with_cursor ='https://api.openalex.org/works?filter=publication_year:2020&cursor={}'
dfs=defaultdict(dict)
paper_id=[]
title_lst=[]
year_lst=[]
lev_lst=[]
page=1
cursor = '*'
# loop through pages
while cursor:
# set cursor value and request page from OpenAlex
url = example_url_with_cursor.format(cursor)
print("\n" + url)
page_with_results = requests.get(url).json()
# loop through partial list of results
results = page_with_results['results']
for i,work in enumerate(results):
openalex_id = work['id'].replace("https://openalex.org/", "")
if work['display_name'] is not None and len(work['display_name'])>0:
openalex_title = work['display_name']
else:
openalex_title='No title'
#openalex_author = work['authorships'][0]['author']['display_name']
openalex_year = work['publication_year']
if work['concepts'] is not None and len(work['concepts'])>0:
openalex_lev = work['concepts'][0]['display_name']
else:
openalex_lev = 'None'
paper_id.append(openalex_id)
title_lst.append(openalex_title)
year_lst.append(openalex_year)
lev_lst.append(openalex_lev)
#Constructing a pandas db:
df=pd.DataFrame(paper_id, columns=['paper_id'])
df['title'] = title_lst
df['level'] = lev_lst
df['pub_year'] = year_lst
#print(openalex_id, end='\t' if (i+1)%5!=0 else '\n')
# update cursor to meta.next_cursor
#page += 1
#dfs[page]=df
cursor = page_with_results['meta']['next_cursor']
This method,, however is very slow and I would like to speed it up a bit through parallelization. Since I am quite new to parallelizing while loops I would kindly ask you if there is a way to do so without messing up the final results.
Specifically, the code above is looping through different data pages, picking up the cursor of the next page, saving the data in a dataframe (df) and going to the next page. The process is repeated until the last page with a cursor in it.

How to Filter json data with python

I am trying to create a function that filters json data pulled from the Google places api. I want it to return the name of a business and the types values if the name contains the string "Body Shop" and the types are ['car_repair', 'point_of_interest', 'establishment'] otherwise I want it to reject the result. Here is my code so far. I have tried and tried and can't seem to figure out a way to store certain criteria to make the search easier.
import googlemaps
import pprint
import time
import urllib.request
API_KEY = 'YOUR_API_KEY'
lat, lng = 40.35003, -111.95206
#define our Client
gmaps = googlemaps.Client(key = API_KEY)
#Define our Search
places_result = gmaps.places_nearby(location= "40.35003,-111.95206", radius= 40000,open_now= False,type= ['car_repair','point_of_interest','establishment'])
#pprint.pprint(places_result['results'])
time.sleep(3)
places_result_2 = gmaps.places_nearby(page_token =
places_result['next_page_token'])
pprint.pprint(places_result_2['results'])
places_result_2 = gmaps.places_nearby(page_token =
places_result['next_page_token'])
types = place_details['result']['types']
name = place_details['result']['name']
def match(types,name):
for val in types:
'car_repair','point_of_interest','establishment' in val and "Body Shop" in name
print(name,types)

Try this:
import googlemaps
import pprint
import time
import urllib.request
API_KEY = 'YOUR_API_KEY'
lat, lng = 40.35003, -111.95206
#define our Client
gmaps = googlemaps.Client(key = API_KEY)
#Define our Search
places_result = gmaps.places_nearby(location= "40.35003,-111.95206", radius= 40000,open_now= False,type= ['car_repair','point_of_interest','establishment'])
#heres how to retrieve the name of the first result
example_of_name = places_result['results'][0]['name']
print(example_of_name)
#gets places name and type for all the results
for place in places_result['results']:
print("Name of Place:")
print(place['name'])
print("Type of the place:")
print(place['types'], "\n")

How to increment variable in the middle of URL and output for multiple queries?

I would like to modify the code below to allow for searching multiple stores at once (via the four digit store number in the 'data' section below). What is the best way to accomplish this? Preferably I would be able to limit the search to 50-100 stores.
import requests
import json
js = requests.post("http://www.walmart.com/store/ajax/search",
data={"searchQuery":"store=2516&size=18&dept=4044&query=43888060"} ).json()
data = json.loads(js['searchResults'])
res = data["results"][0]
print(res["name"], res["inventory"])
I would also like the store # printed in the line above.

Your data object in the request.post call can be constructed like any other string. And then a variable that represents it can take the place of your "store=2516..." string. Like this, assuming requests is defined in the outer function someplace and can be reused:
var stores = ["2516","3498","5478"];
stores.forEach( makeTheCall );
function makeTheCall( element, index, array ) {
storeQuery = "store=" + element + "&size=18&dept=4044&query=43888060";
js = requests.post("http://www.walmart.com/store/ajax/search",
data={"searchQuery":storeQuery} ).json()
data = json.loads(js['searchResults'])
res = data["results"][0]
console.log("name: " + res["name"] + ", store: " + element + ", inventory: " + res["inventory"]);
}
I'm not familiar with your use of "print", but I've only ever used client side javascript.

The API does not support searching for multiple stores, so you need to make multiple requests.
import requests
import json
from collections import defaultdict
results = defaultdict(list)
stores = [2516, 1234, 5678]
url = "http://www.walmart.com/store/ajax/search"
query = "store={}&size=18&dept=4044&query=43888060"
for store in stores:
r = requests.post(url, data={'searchQuery': query.format(store)})
r.raise_for_status()
try:
data = json.loads(r.json()['searchResults'])['results'][0]
results[store].append((data['name'],data['inventory']))
except IndexError:
continue
for store, data in results.iteritems():
print('Store: {}'.format(store))
if data:
for name, inventory in data:
print('\t{} - {}'.format(name, inventory))

Run only if geo-info available in python

Here is a snippet of the code:
import flickrapi
api_key = "xxxxxxxxxxxxxxxxxxxxx"
secret_api_key = "xxxxxxxxxx"
flickr = flickrapi.FlickrAPI(api_key, secret_api_key)
def obtainImages3():
group_list = flickr.groups.search (api_key=api_key, text = 'Paris', per_page = 10)
for group in group_list[0]:
group_images = flickr.groups.pools.getPhotos (api_key=api_key, group_id = group.attrib['nsid'], extras = 'geo, tags, url_s')
for image in group_images[0]:
url = image.attrib['url_s']
tags = image.attrib['tags']
if image.attrib['geo'] != 'null':
photo_location = flickr.photos_geo_getLocation(photo_id=image.attrib['id'])
lat = float(photo_location[0][0].attrib['latitude'])
lon = float(photo_location[0][0].attrib['longitude'])
I want to get information about images if and only if they have a geo-tag connected to them. I tried to do this with the line if image.attrib['geo'] != 'null' but I don't think this works. Can anyone suggest a way I might be able to do it, thanks in advance!

Replace your if image.attrib['geo']!='null' condition with a try and exception block as below.
Since the API returns the data in JSON format you can check the presence of key using:
try:
image.attrib['geo']
photo_location=flickr.photos_geo_getLocation(photo_id=image.attrib['id'])
lat = float(photo_location[0][0].attrib['latitude'])
lon = float(photo_location[0][0].attrib['longitude'])
except KeyError:
pass

Get All Follower IDs in Twitter by Tweepy

Is it possible to get the full follower list of an account who has more than one million followers, like McDonald's?
I use Tweepy and follow the code:
c = tweepy.Cursor(api.followers_ids, id = 'McDonalds')
ids = []
for page in c.pages():
ids.append(page)
I also try this:
for id in c.items():
ids.append(id)
But I always got the 'Rate limit exceeded' error and there were only 5000 follower ids.

In order to avoid rate limit, you can/should wait before the next follower page request. Looks hacky, but works:
import time
import tweepy
auth = tweepy.OAuthHandler(..., ...)
auth.set_access_token(..., ...)
api = tweepy.API(auth)
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name="McDonalds").pages():
ids.extend(page)
time.sleep(60)
print len(ids)
Hope that helps.

Use the rate limiting arguments when making the connection. The api will self control within the rate limit.
The sleep pause is not bad, I use that to simulate a human and to spread out activity over a time frame with the api rate limiting as a final control.
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True)
also add try/except to capture and control errors.
example code
https://github.com/aspiringguru/twitterDataAnalyse/blob/master/sample_rate_limit_w_cursor.py
I put my keys in an external file to make management easier.
https://github.com/aspiringguru/twitterDataAnalyse/blob/master/keys.py

I use this code and it works for a large number of followers :
there are two functions one for saving followers id after every sleep period and another one to get the list :
it is a little missy but I hope to be useful.
def save_followers_status(filename,foloowersid):
path='//content//drive//My Drive//Colab Notebooks//twitter//'+filename
if not (os.path.isfile(path+'_followers_status.csv')):
with open(path+'_followers_status.csv', 'wb') as csvfile:
filewriter = csv.writer(csvfile, delimiter=',')
if len(foloowersid)>0:
print("save followers status of ", filename)
file = path + '_followers_status.csv'
# https: // stackoverflow.com / questions / 3348460 / csv - file - written -with-python - has - blank - lines - between - each - row
with open(file, mode='a', newline='') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for row in foloowersid:
writer.writerow(np.array(row))
csv_file.closed
def get_followers_id(person):
foloowersid = []
count=0
influencer=api.get_user( screen_name=person)
influencer_id=influencer.id
number_of_followers=influencer.followers_count
print("number of followers count : ",number_of_followers,'\n','user id : ',influencer_id)
status = tweepy.Cursor(api.followers_ids, screen_name=person, tweet_mode="extended").items()
for i in range(0,number_of_followers):
try:
user=next(status)
foloowersid.append([user])
count += 1
except tweepy.TweepError:
print('error limite of twiter sleep for 15 min')
timestamp = time.strftime("%d.%m.%Y %H:%M:%S", time.localtime())
print(timestamp)
if len(foloowersid)>0 :
print('the number get until this time :', count,'all folloers count is : ',number_of_followers)
foloowersid = np.array(str(foloowersid))
save_followers_status(person, foloowersid)
foloowersid = []
time.sleep(15*60)
next(status)
except :
print('end of foloowers ', count, 'all followers count is : ', number_of_followers)
foloowersid = np.array(str(foloowersid))
save_followers_status(person, foloowersid)
foloowersid = []
save_followers_status(person, foloowersid)
# foloowersid = np.array(map(str,foloowersid))
return foloowersid

The answer from alecxe is good, however no one has referred to the docs. The correct information and explanation to answer the question lives in the Twitter API documentation. From the documentation :
Results are given in groups of 5,000 user IDs and multiple “pages” of results can be navigated through using the next_cursor value in subsequent requests.

Tweepy's "get_follower_ids()" uses https://api.twitter.com/1.1/followers/ids.json endpoint. This endpoint has a rate limit (15 requests per 15 min).
You are getting the 'Rate limit exceeded' error, cause you are crossing that threshold.
Instead of manually putting the sleep in your code you can use wait_on_rate_limit=True when creating the Tweepy API object.
Moreover, the endpoint has an optional parameter count which specifies the number of users to return per page. The Twitter API documentation does not says anything about its default value. Its maximum value is 5000.
To get the most ids per request explicitly set it to the maximum. So that you need fewer requests.
Here is my code for getting all the followers' ids:
auth = tweepy.OAuth1UserHandler(consumer_key = '', consumer_secret = '',
access_token= '', access_token_secret= '')
api = tweepy.API(auth, wait_on_rate_limit=True)
account_id = 71026122 # instead of account_id you can also use screen_name
follower_ids = []
for page in tweepy.Cursor(api.get_follower_ids, user_id = account_id, count = 5000).pages():
follower_ids.extend(page)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Access bulk business data via Google Maps API with python - python

Related

Parallelize openalex API

How to Filter json data with python

How to increment variable in the middle of URL and output for multiple queries?

Run only if geo-info available in python

Get All Follower IDs in Twitter by Tweepy

Categories

Resources