I'm a newbee in Python, HTML and CSS and am trying to reverse engineer "https://github.com/scraperwiki/google-search-python" to learn the three and use the Google Custom Search API to search the web from Python. Specifically, I want to search the search engine I made through Google Custom Search "https://cse.google.com/cse/publicurl?cx=000839040200690289140:u2lurwk5tko"I looked through the code made some minor adjustments and came up with the following. "Search.py"
import os
from google_search import GoogleCustomSearch
#This is for the traceback
import traceback
import sys
#set variables
os.environ["SEARCH_ENGINE_ID"] = "000839... "
os.environ["GOOGLE_CLOUD_API_KEY"] = "AIza... "
SEARCH_ENGINE_ID = os.environ['SEARCH_ENGINE_ID']
API_KEY = os.environ['GOOGLE_CLOUD_API_KEY']
api = GoogleCustomSearch(SEARCH_ENGINE_ID, API_KEY)
print("we got here\n")
#for result in api.search('prayer', 'https://cse.google.com/cse/publicurl?cx=000839040200690289140:u2lurwk5tko'):
for result in api.search('pdf', 'http://scraperwiki.com'):
print(result['title'])
print(result['link'])
print(result['snippet'])
print traceback.format_exc()
And the import ("At least the relevant parts") I believe comes from the following code in google_search.py
class GoogleCustomSearch(object):
def __init__(self, search_engine_id, api_key):
self.search_engine_id = search_engine_id
self.api_key = api_key
def search(self, keyword, site=None, max_results=100):
assert isinstance(keyword, basestring)
for start_index in range(1, max_results, 10): # 10 is max page size
url = self._make_url(start_index, keyword, site)
logging.info(url)
response = requests.get(url)
if response.status_code == 403:
LOG.info(response.content)
response.raise_for_status()
for search_result in _decode_response(response.content):
yield search_result
if 'nextPage' not in search_result['meta']['queries']:
print("No more pages...")
return
However, when I try to compile it, I get the following.
So, here's my problem. I cant quite figure out why the following lines of code don't print to the terminal. What am I overlooking?
print(result['title'])
print(result['link'])
print(result['snippet'])
The only thing I can think of is that I didn't take a correct ID or something. I created a Google custom search and a project on Google developers console as the quick start suggested. Here is where I got my SEARCH_ENGINE_ID and GOOGLE_CLOUD_API_KEY from.
After I added the stacktrace suggested in the comments I got this
Am I just misunderstanding the code, or is there something else I'm missing? I really appreciate any clues that will help me solve this problem, I'm kind of stumped right now.
Thanks in advance guys!
Related
I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?
I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).
start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system
As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue
My goal is to connect to Youtube API and download the URLs of specific music producers.I found the following script which I used from the following link: https://www.youtube.com/watch?v=_M_wle0Iq9M. In the video the code works beautifully. But when I try it on python 2.7 it gives me KeyError:'items'.
I know KeyErrors can occur when there is an incorrect use of a dictionary or when a key doesn't exist.
I have tried going to the google developers site for youtube to make sure that 'items' exist and it does.
I am also aware that using get() may be helpful for my problem but I am not sure. Any suggestions to fixing my KeyError using the following code or any suggestions on how to improve my code to reach my main goal of downloading the URLs (I have a Youtube API)?
Here is the code:
#these modules help with HTTP request from Youtube
import urllib
import urllib2
import json
API_KEY = open("/Users/ereyes/Desktop/APIKey.rtf","r")
API_KEY = API_KEY.read()
searchTerm = raw_input('Search for a video:')
searchTerm = urllib.quote_plus(searchTerm)
url = 'https://www.googleapis.com/youtube/v3/search?part=snippet&q='+searchTerm+'&key='+API_KEY
response = urllib.urlopen(url)
videos = json.load(response)
videoMetadata = [] #declaring our list
for video in videos['items']: #"for loop" cycle through json response and searches in items
if video['id']['kind'] == 'youtube#video': #makes sure that item we are looking at is only videos
videoMetadata.append(video['snippet']['title']+ # getting title of video and putting into list
"\nhttp://youtube.com/watch?v="+video['id']['videoId'])
videoMetadata.sort(); # sorts our list alphaetically
print ("\nSearch Results:\n") #print out search results
for metadata in videoMetadata:
print (metadata)+"\n"
raw_input('Press Enter to Exit')
The problem is most likely a combination of using an RTF file instead of a plain text file for the API key and you seem to be confused whether to use urllib or urllib2 since you imported both.
Personally, I would recommend requests, but I think you need to read() the contents of the request to get a string
response = urllib.urlopen(url).read()
You can check that by printing the response variable
forgive me, if if come straight out with it but python drives me nuts at something what seemed to be quite simple.
In a nutshell
I'm writing an extension for a musicvideo scraper which is responsible for getting the fanart backdrop.
Here is the URL:
github.com/MViDLibraryToolKit/.../APICaller
So I was able to call the Fanart.tv API and receiving the right json response. My problem is that i'm to dumb to collect the URLs under the Element "artistbackground"
I search the internet and found a very similar post here at stackoverflow but unluckily this was concerning python2,API V2 and a different category at fanart.tv so I was not able to take use out of it. Here it was
Anyway, here is my poor Try to collect URLs to list
# --------------------- Response Verarbeitung
# Ausgabe zwecks Debug
# print(fanartTVresp)
# http://webservice.fanart.tv/v3/music/albums/ba853904-ae25-4ebb-89d6-c44cfbd71bd2?api_key=fdadba00cfaaf3621eaa748669256a9e&client_key=dce01d75553d7e3fbc2ad742aaf5d371
# zu befüllende Liste
url_list = []
# lade Web-Response
json_response = json.loads(fanartTVresp)
# durch Element artistbackground loopen
for artistbackground in json_response:
url = urllib.parse.quote(['url'], ':/')
if url:
url_list.append(url)
print(url_list)
The libs I loaded...
import musicbrainzngs
import urllib
import json
import socket
from pprint import pprint
from urllib.parse import quote
The rest from the code you can find at my github link. Please help me, it drives me crazy ^^
Kind regards
p.s. Please excuse my english, I came from germany :)
I think I finally got it.
# URL List for background images
url_list = []
# set only for debug / value came from powershell runtime later
location = os.path.abspath('C:/temp')
# decode json
json_response = json.loads(fanartTVresp.decode())
# set string objects
bgitem = json_response["artistbackground"]
bgcoverurl = json_response["artistbackground"][0]["url"]
# iterating items and collect
for bgcoverurl in bgitem:
url_list.append(bgcoverurl)
print(url_list)
After taking some hourse of sleep I reallized that "json.loads" deserialized the response to regular python objects. Correct me if I'm wrong.
Anyway, it finally works!
So I'm trying to create a Python script that will take a search term or query, then search google for that term. It should then return 5 URL's from the result of the search term.
I spent many hours trying to get PyGoogle to work. But later found out Google no longer supports the SOAP API for search, nor do they provide new license keys. In a nutshell, PyGoogle is pretty much dead at this point.
So my question here is... What would be the most compact/simple way of doing this?
I would like to do this entirely in Python.
Thanks for any help
Use BeautifulSoup and requests to get the links from the google search results
import requests
from bs4 import BeautifulSoup
keyword = "Facebook" #enter your keyword here
search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + keyword
r = requests.get(search)
soup = BeautifulSoup(r.text, "html.parser")
container = soup.find('div',{'id':'search'})
url = container.find("cite").text
print(url)
What issues are you having with pygoogle? I know it is no longer supported, but I've utilized that project on many occasions and it would work fine for the menial task you have described.
Your question did make me curious though--so I went to Google and typed "python google search". Bam, found this repository. Installed with pip and within 5 minutes of browsing their documentation got what you asked:
import google
for url in google.search("red sox", num=5, stop=1):
print(url)
Maybe try a little harder next time, ok?
Here, link is the xgoogle library to do the same.
I tried similar to get top 10 links which also counts words in links we are targeting. I have added the code snippet for your reference :
import operator
import urllib
#This line will import GoogleSearch, SearchError class from xgoogle/search.py file
from xgoogle.search import GoogleSearch, SearchError
my_dict = {}
print "Enter the word to be searched : "
#read user input
yourword = raw_input()
try:
#This will perform google search on our keyword
gs = GoogleSearch(yourword)
gs.results_per_page = 80
#get google search result
results = gs.get_results()
source = ''
#loop through all result to get each link and it's contain
for res in results:
#print res.url.encode('utf8')
#this will give url
parsedurl = res.url.encode("utf8")
myurl = urllib.urlopen(parsedurl)
#above line will read url content, in below line we parse the content of that web page
source = myurl.read()
#This line will count occurrence of enterd keyword in our webpage
count = source.count(yourword)
#We store our result in dictionary data structure. For each url, we store it word occurent. Similar to array, this is dictionary
my_dict[parsedurl] = count
except SearchError, e:
print "Search failed: %s" % e
print my_dict
#sorted_x = sorted(my_dict, key=lambda x: x[1])
for key in sorted(my_dict, key=my_dict.get, reverse=True):
print(key,my_dict[key])
I'm extremely new to coding in general; I delved into this project in order to help my friend tag her fifteen thousand and some-odd posts on Tumblr. We've finally finished, but she wants to be sure that we haven't missed anything...
So, I've scoured the internet, trying to find a coding solution. I came across a script found here, that allegedly does exactly what we need -- so I downloaded Python, and...It doesn't work.
More specifically, when I click on the script, a black box appears for about half a second and then disappears. I haven't been able to screenshot the box to find out exactly what it says, but I believe it says there's a syntax error. At first, I tried with Python 2.4; it didn't seem to find the Json module the creator uses, so I switched to Python 3.3 -- the most recent version for Windows, and this is where the Syntax errors occur.
#!/usr/bin/python
import urllib2
import json
hostname = "(Redacted for Privacy)"
api_key = "(Redacted for Privacy)"
url = "http://api.tumblr.com/v2/blog/" + hostname + "/posts?api_key=" + api_key
def api_response(url):
req = urllib2.urlopen(url)
return json.loads(req.read())
jsonresponse = api_response(url)
post_count = jsonresponse["response"]["total_posts"]
increments = (post_count + 20) / 20
for i in range(0, increments):
jsonresponse = api_response(url + "&offset=" + str((i * 20)))
posts = jsonresponse["response"]["posts"]
for i in range(0, len(posts)):
if not posts[i]["tags"]:
print posts[i]["post_url"]
print("All finished!")
So, uhm, my question is this: If this coding has a syntax error that could be fixed and then used to find the Untagged Posts on Tumblr, what might that error be?
If this code is outdated (either via Tumblr or via Python updates), then might someone with a little free time be willing to help create a new script to find Untagged posts on Tumblr? Searching Tumblr, this seems to be a semi-common problem.
In case it matters, Python is installed in C:\Python33.
Thank you for your assistance.
when I click on the script, a black box appears for about half a second and then
disappears
At the very least, you should be able to run a Python script from the command line e.g., do Exercise 0 from "Learn Python The Hard Way".
"Finding Untagged Posts on Tumblr" blog post contains Python 2 script (look at import urllib2 in the source. urllib2 is renamed to urllib.request in Python 3). It is easy to port the script to Python 3:
#!/usr/bin/env python3
"""Find untagged tumblr posts.
Python 3 port of the script from
http://www.alexwlchan.net/2013/08/untagged-tumblr-posts/
"""
import json
from itertools import count
from urllib.request import urlopen
hostname, api_key = "(Redacted for Privacy)", "(Redacted for Privacy)"
url = "https://api.tumblr.com/v2/blog/{blog}/posts?api_key={key}".format(
blog=hostname, key=api_key)
for offset in count(step=20):
r = json.loads(urlopen(url + "&offset=" + str(offset)).read().decode())
posts = r["response"]["posts"]
if not posts: # no more posts
break
for post in posts:
if not post["tags"]: # no tags
print(post["post_url"])
Here's the same functionality implemented using the official Python Tumblr API v2 Client (Python 2 only library):
#!/usr/bin/env python
from itertools import count
import pytumblr # $ pip install pytumblr
hostname, api_key = "(Redacted for Privacy)", "(Redacted for Privacy)"
client = pytumblr.TumblrRestClient(api_key, host="https://api.tumblr.com")
for offset in count(step=20):
posts = client.posts(hostname, offset=offset)["posts"]
if not posts: # no more posts
break
for post in posts:
if not post["tags"]: # no tags
print(post["post_url"])
Tumblr has an API. You probably would have much better success using it.
https://code.google.com/p/python-tumblr/