Google Search API - Only returning 4 results - python

After much experimenting and googling, the following Python code successfully calls Google's Search APi - but only returns 4 results: after reading the Google Search API docs, I thought the 'start=' would return additional results: but this not happen.
Can anyone give pointers? Thanks.
Python code:
/usr/bin/python
import urllib
import simplejson
query = urllib.urlencode({'q' : 'site:example.com'})
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s&start=50' \
% (query)
search_results = urllib.urlopen(url)
json = simplejson.loads(search_results.read())
results = json['responseData']['results']
for i in results:
print i['title'] + ": " + i['url']

The start option doesn't give you more results, it just moves you forward that many results. Think of the results as a queue. Starting at 50 will give you results 50, 51, 52, and 53.
With this you can get more results by starting every 4th result:
import urllib
import simplejson
num_queries = 50*4
query = urllib.urlencode({'q' : 'example'})
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
for start in range(0, num_queries, 4):
request_url = '{0}&start={1}'.format(url, start)
search_results = urllib.urlopen(request_url)
json = simplejson.loads(search_results.read())
results = json['responseData']['results']
for i in results:
print i['title'] + ": " + i['url']

Related

How do I submit a Stack Exchange API query that returns the same results as the basic Stack Overflow search?

I am currently working on a project with the goal of determining the popularity of various topics on gis.stackexchange. I am using Python to interface with the stack exchange API. My issue is I am having trouble configuring the API query to match what a basic search using the search bar would return (showing posts containing the term (x)). I am currently using the /search/advanced... q="term" method, however I am getting empty results for search terms that might have around 100-200 posts. I have read a lot of the API documentation, but can't seem to configure the API query to match what a site search would yield.
Edit: For example, if I search, "Bayesian", I get 42 results on gis.stackexchange, but when I set q=Bayesian in the API request I get an empty return.
I have included my program below if it helps. Thanks!
#Interfacing_with_SO_API
import requests as rq
import json
import time
keywordinput = input('Enter your search term. If two words seperate by - : ')
baseurl = ('https://api.stackexchange.com/2.3/search/advanced?page=')
endurl = ('&pagesize=100&order=desc&sort=votes&q=' + keywordinput + '&site=gis.stackexchange&filter=!-nt6H9O0imT9xRAnV1gwrp1ZOq7FBaU7CRaGpVkODaQgDIfSY8tJXb')
urltot = ('https://api.stackexchange.com/2.3/search/advanced?page=1&pagesize=100&order=desc&sort=votes&q=' + keywordinput + '&site=gis.stackexchange&filter=!-nt6H9O0imT9xRAnV1gwrp1ZOq7FBaU7CRaGpVkODaQgDIfSY8tJXb')
response = rq.get(urltot)
page = range(1,400)
if response.status_code == 400:
print('Initial Response Code 400: Stopping')
exit()
elif response.status_code == 200:
print('Initial Response Code 200: Continuing')
datarr = []
for n in page:
response = rq.get(baseurl + str(n) + endurl)
print(baseurl + str(n) + endurl)
time.sleep(2)
if response.status_code == 400 or response.json()['has_more'] == False or n >400:
print('No more pages')
break
elif response.json()['has_more'] == True:
for data in response.json()['items']:
if data['view_count'] >= 0:
datarr.append(data)
print(data['view_count'])
print(data['answer_count'])
print(data['score'])
#convert datarr to csv and save to file
with open(input('Search Term Name (filename): ') + '.csv', 'w') as f:
for data in datarr:
f.write(str(data['view_count']) + ',' + str(data['answer_count']) + ','+ str(data['score']) + '\n')
exit()
If you look at the results for searching bayesian on the GIS StackExchange site, you'll get 42 results because the StackExchange site search returns both questions and answers that contain the term.
However, the standard /search and /search/advanced API endpoints only search questions, per the doc (emphasis mine):
Searches a site for any questions which fit the given criteria
Discussion
Searches a site for any questions which fit the given criteria.
Instead, what you want to use is the /search/excerpts endpoint, which will return both questions and answers.
Quick demo in the shell to show that it returns the same number of items:
curl -s --compressed "https://api.stackexchange.com/2.3/search/excerpts?page=1&pagesize=100&site=gis&q=bayesian" | jq '.["items"] | length'
42
And a minimal Python program to do the same:
#!/usr/bin/env python3
# file: test_so_search.py
import requests
if __name__ == "__main__":
api_url = "https://api.stackexchange.com/2.3/search/excerpts"
search_term = "bayesian"
qs = {
"page": 1,
"pagesize": 100,
"order": "desc",
"sort": "votes",
"site": "gis",
"q": search_term
}
rsp = requests.get(api_url, qs)
data = rsp.json()
print(f"Got {len(data['items'])} results for '{search_term}'")
And output:
> python test_so_search.py
Got 42 results for 'bayesian'

Python YouTube Page Token Issue

I'm running into an intermittent issue when I run the code below. I'm trying to collect all the page_tokens in the ajax calls made by pressing the "load more" button if it exists. Basically, I'm trying to get all the page tokens from a YouTube Channel.
Sometimes it will retrieve the tokens, and other times it doesn't. My best guess is either I made a mistake in my "find_embedded_page_token" function or that I need some sort of delay/sleep inserted somewhere.
Below is the full code:
import requests
import pprint
import urllib.parse
import lxml
def find_XSRF_token(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('"', pos_begin)
return html[pos_begin: pos_end]
def find_page_token(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('&', pos_begin)
return html[pos_begin: pos_end]
def find_embedded_page_token(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('&', pos_begin)
excess_str = html[pos_begin: pos_end]
sep = '\\'
rest = excess_str.split(sep,1)[0]
return rest
sxeVid = 'https://www.youtube.com/user/sxephil/videos'
ajaxStr = 'https://www.youtube.com/browse_ajax?action_continuation=1&continuation='
s = requests.Session()
r = s.get(sxeVid)
html = r.text
session_token = find_XSRF_token(html, 'XSRF_TOKEN', 4)
page_token = find_page_token(html, ';continuation=', 0)
print(page_token)
s = requests.Session()
r = s.get(ajaxStr+page_token)
ajxHtml = r.text
ajax_page_token = find_embedded_page_token(ajxHtml, ';continuation=', 0)
while page_token:
ajxBtn = ajxHtml.find('data-uix-load-more-href=')
if ajxBtn != -1:
s = requests.Session()
r = s.get(ajaxStr+ajax_page_token)
ajxHtml = r.text
ajax_page_token = find_embedded_page_token(ajxHtml, ';continuation=', 0)
print(ajax_page_token)
else:
break
This is what's returning randomly that is unexpected. It's pulling not just the token, but also the HTML after the desired cut off.
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D"><span class="yt-uix-button-content"> <span class="load-more-loading hid">
<span class="yt-spinner">
<span class="yt-spinner-img yt-sprite" title="Loading icon"></span>
The expected response I'm expecting is this:
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5MZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5iZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5yZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk43Z0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk9MZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk9iZ0JBQSUzRCUzRA%253D%253D
Any help is greatly appreciated. Also, if my tags are wrong, let me know what tags to +/-.

Python script works but fails after compilation (Windows)

I am working on a script to scrape a website, the problem is that it works normally when I run it with the interpreter, however after compiling it (PyInstaller or Py2exe) it fails, it appears to be that mechanize / requests both fail to keep the session alive.
I have hidden my username and password here, but I did put them correctly in the compiled code
import requests
from bs4 import BeautifulSoup as bs
from sys import argv
import re
import logging
url = argv[1]
payload = {"userName": "real_username", "password": "realpassword"}
session = requests.session()
resp = session.post("http://website.net/login.do", data=payload)
if "forgot" in resp.content:
logging.error("Login failed")
exit()
resp = session.get(url)
soup = bs(resp.content)
urlM = url[:url.find("?") + 1] + "page=(PLACEHOLDER)&" + \
url[url.find("?") + 1:]
# Get number of pages
regex = re.compile("\|.*\|\sof\s(\d+)")
script = str(soup.findAll("script")[1])
epNum = int(re.findall(regex, script)[0]) # Number of EPs
pagesNum = epNum // 50
links = []
# Get list of links
# If number of EPs > 50, more than one page
if pagesNum == 0:
links = [url]
else:
for i in range(1, pagesNum + 2):
url = urlM.replace("(PLACEHOLDER)", str(i))
links.append(url)
# Loop over the links and extract info: ID, NAME, START_DATE, END_DATE
raw_info = []
for pos, link in enumerate(links):
print "Processing page %d" % (pos + 1)
sp = bs(session.get(link).content)
table = sp.table.table
raw_info.extend(table.findAll("td"))
epURL = "http://www.website.net/exchange/viewep.do?operation"\
"=executeAction&epId="
# Final data extraction
raw_info = map(str, raw_info)
ids = [re.findall("\d+", i)[0] for i in raw_info[::4]]
names = [re.findall("<td>(.*)</td", i)[0] for i in raw_info[1::4]]
start_dates = [re.findall("<td>(.*)</td", i)[0] for i in raw_info[2::4]]
end_dates = [re.findall("<td>(.*)</td", i)[0] for i in raw_info[3::4]]
emails = []
eplinks = [epURL + str(i) for i in ids]
print names
The error happens on the level of epNum variable, this means as I figured that the HTML page is not the one I requested, but it works normally on linux script and compiled, work on widows as script but fails when compiled.
The py2exe tutorial mentions that you need MSVCR90.dll, did you check its present on the PC?

request empty result issue

I have this simple python code, which returning the content of URL and store the result as json text file named "file", but it keeps returning empty result .
What I am doing wrong here? It is just a simple code I am so disappointed.
I have included all the imports needed import Facebook,import request,and import json.
url ="https://graph.facebook.com/search?limit=5000&type=page&q=%26&access_token=xx&__after_id=139433456868"
content = requests.get(url).json()
file = open("file.txt" , 'w')
file.write(json.dumps(content, indent=1))
file.close()
but it keeps returning empty result to me what I am missing here?
here is the result:
"data": []
any help please?
Its working fine:
import urllib2
accesstoken="CAACEdEose0cBACF6HpTDEuVEwVnjx1sHOJFS3ZBQZBsoWqKKyFl93xwZCxKysqsQMUgOZBLjJoMurSxinn96pgbdfSYbyS9Hh3pULdED8Tv255RgnsYmnlxvR7JZCN7X25zP6fRnRK0ZCNmChfLPejaltwM2JGtPGYBQwnmAL9tQBKBmbZAkGYCEQHAbUf7k1YZD"
urllib2.urlopen("https://graph.facebook.com/search?limit=5000&type=page&q=%26&access_token="+accesstoken+"&__after_id=139433456868").read()
I think you have not requested access token before making the request.
How to find access token?
def getSecretToken(verification_code):
token_url = ( "https://graph.facebook.com/oauth/access_token?" +
"client_id=" + app_id +
"&redirect_uri=" +my_url +
"&client_secret=" + app_secret +
"&code=" + verification_code )
response = requests.get(token_url).content
params = {}
result = response.split("&", 1)
print result
for p in result:
(k,v) = p.split("=")
params[k] = v
return params['access_token']
how do you get that verification code?
verification_code=""
if "code" in request.query:
verification_code = request.query["code"]
if not verification_code:
dialog_url = ( "http://www.facebook.com/dialog/oauth?" +
"client_id=" + app_id +
"&redirect_uri=" + my_url +
"&scope=publish_stream" )
return "<script>top.location.href='" + dialog_url + "'</script>"
else:
access_token = getSecretToken(verification_code)

HTTPError: HTTP Error 400: Bad request urllib2

I am beginner to python. I am the developer of Easy APIs Project (http://gcdc2013-easyapisproject.appspot.com) and was doing a Python implementation of weather API using my project. Visit http://gcdc2013-easyapisproject.appspot.com/APIs_Doc.html to see Weather API. The below is my implementation but it returns HTTPError: HTTP Error 400: Bad request error.
import urllib2
def celsius(a):
responsex = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/unitconversion?q='+a+' in celsius')
htmlx = responsex.read()
responsex.close()
htmlx = html[1:] #remove first {
htmlx = html[:-1] #remove last }
htmlx = html.split('}{') #split and put each resutls to array
return str(htmlx[1]);
print "Enter a city name:",
q = raw_input() #get word from user
response = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/weather?q='+q)
html = response.read()
response.close()
html = html[1:] #remove first {
html = html[:-1] #remove last }
html = html.split('}{') #split and put each resutls to array
print "Today weather is " + html[1]
print "Temperature is " + html[3]
print "Temperature is " + celsius(html[3])
Please help me..
The query string should be quoted using urllib.quote or urllib.quote_plus:
import urllib
import urllib2
def celsius(a):
responsex = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/unitconversion?q=' + urllib.quote(a + ' in celsius'))
html = responsex.read()
responsex.close()
html = html[1:] #remove first {
html = html[:-1] #remove last }
html = html.split('}{') #split and put each resutls to array
return html[0]
print "Enter a city name:",
q = raw_input() #get word from user
response = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/weather?q='+urllib.quote(q))
html = response.read()
print repr(html)
response.close()
html = html[1:] #remove first {
html = html[:-1] #remove last }
html = html.split('}{') #split and put each resutls to array
print "Today weather is " + html[1]
print "Temperature is " + html[3]
print "Temperature is " + celsius(html[3].split()[0])
In addition to that, I modified celsius to use html instead of htmlx. The original code mixed use of html and htmlx.
I have found the answer. The query should be quoted with urllib2.quote(q)

Categories

Resources