Updated approach to Google search with python

Updated approach to Google search with python - python

I was trying to use xgoogle but I has not been updated for 3 years and I just keep getting no more than 5 results even if I set 100 results per page. If anyone uses xgoogle without any problem please let me know.
Now, since the only available(apparently) wrapper is xgoogle, the option is to use some sort of browser, like mechanize, but that is gonna make the code entirely dependant on google HTML and they might change it a lot.
Final option is to use the Custom search API that google offers, but is has a redicolous 100 requests per day limit and a pricing after that.
I need help on which direction should I go, what other options do you know of and what works for you.
Thanks !

It only needs a minor patch.
The function GoogleSearch._extract_result (Line 237 of search.py) calls GoogleSearch._extract_description (Line 258) which fails causing _extract_result to return None for most of the results therefore showing fewer results than expected.
Fix:
In search.py, change Line 259 from this:
desc_div = result.find('div', {'class': re.compile(r'\bs\b')})
to this:
desc_div = result.find('span', {'class': 'st'})
I tested using:
#!/usr/bin/python
#
# This program does a Google search for "quick and dirty" and returns
# 200 results.
#
from xgoogle.search import GoogleSearch, SearchError
class give_me(object):
def __init__(self, query, target):
self.gs = GoogleSearch(query)
self.gs.results_per_page = 50
self.current = 0
self.target = target
self.buf_list = []
def __iter__(self):
return self
def next(self):
if self.current >= self.target:
raise StopIteration
else:
if(not self.buf_list):
self.buf_list = self.gs.get_results()
self.current += 1
return self.buf_list.pop(0)
try:
sites = {}
for res in give_me("quick and dirty", 200):
t_dict = \
{
"title" : res.title.encode('utf8'),
"desc" : res.desc.encode('utf8'),
"url" : res.url.encode('utf8')
}
sites[t_dict["url"]] = t_dict
print t_dict
except SearchError, e:
print "Search failed: %s" % e

I think you misunderstand what xgoogle is. xgoogle is not a wrapper; it's a library that fakes being a human user with a browser, and scrapes the results. It's heavily dependent on the format of Google's search queries and results pages as of 2009, so it's no surprise that it doesn't work the same in 2013. See the announcement blog post for more details.
You can, of course, hack up the xgoogle source and try to make it work with Google's current format (as it turns out, they've only broken xgoogle by accident, and not very badly…), but it's just going to break again.
Meanwhile, you're trying to get around Google's Terms of Service:
Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide.
They're been specifically asked about exactly what you're trying to do, and their answer is:
Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google.
And you even say that's explicitly what you want to do:
Final option is to use the Custom search API that google offers, but is has a redicolous 100 requests per day limit and a pricing after that.
So, you're looking for a way to access Google search using a method other than the interface they provide, in a deliberate attempt to get around their free usage quota without paying. They are completely within their rights to do anything they want to break your code—and if they get enough hits from people doing things kind of thing, they will do so.
(Note that when a program is scraping the results, nobody's seeing the ads, and the ads are what pay for the whole thing.)
Of course nobody's forcing you to use Google. EntireWeb has an free "unlimited" (as in "as long as you don't use too much, and we haven't specified the limit") search API. Bing gives you a higher quota, and amortized by month instead of by day. Yahoo BOSS is flexible and super-cheap (and even offers a "discount server" that provides lower-quality results if it's not cheap enough), although I believe you're forced to type the ridiculous exclamation point. If none of them are good enough for you… then pay for Google.

Related

Struggling with how to iterate data

I am learning Python3 and I have a fairly simple task to complete but I am struggling how to glue it all together. I need to query an API and return the full list of applications which I can do and I store this and need to use it again to gather more data for each application from a different API call.
applistfull = requests.get(url,authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
else:
print(applistfull.status_code)
I next have I think 'summaryguid' and I need to again query a different API and return a value that could exist many times for each application; in this case the compiler used to build the code.
I can statically call a GUID in the URL and return the correct information but I haven't yet figured out how to get it to do the below for all of the above and build a master list:
summary = requests.get(f"url{summaryguid}moreurl",authmethod)
if summary.ok:
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(appsummary["compiler"])
I would prefer to not yet have someone just type out the right answer but just drop a few hints and let me continue to work through it logically so I learn how to deal with what I assume is a common issue in the future. My thought right now is I need to move my second if up as part of my initial block and continue the logic in that space but I am stuck with that.

You are on the right track! Here is the hint: the second API request can be nested inside the loop that iterates through the list of applications in the first API call. By doing so, you can get the information you require by making the second API call for each application.

import requests
applistfull = requests.get("url", authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
summary = requests.get(f"url/{summaryguid}/moreurl", authmethod)
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(app["profile"]["name"],appsummary["compiler"])
else:
print(applistfull.status_code)

bingads V13 report request fails in python sdk

I try to download a bingads report using python SDK, but I keep getting an error says: "Type not found: 'Aggregation'" after submitting a report request. I've tried all 4 options mentioned in the following link:
https://github.com/BingAds/BingAds-Python-SDK/blob/master/examples/v13/report_requests.py
Authentication process prior to request works just fine.
I execute the following:
report_request = get_report_request(authorization_data.account_id)
reporting_download_parameters = ReportingDownloadParameters(
report_request=report_request,
result_file_directory=FILE_DIRECTORY,
result_file_name=RESULT_FILE_NAME,
overwrite_result_file=True, # Set this value true if you want to overwrite the same file.
timeout_in_milliseconds=TIMEOUT_IN_MILLISECONDS
)
output_status_message("-----\nAwaiting download_report...")
download_report(reporting_download_parameters)
after a careful debugging, it seems that the program fails when trying to execute a command within "reporting_service_manager.py". Here is workflow:
download_report(self, download_parameters):
report_file_path = self.download_file(download_parameters)
then:
download_file(self, download_parameters):
operation = self.submit_download(download_parameters.report_request)
then:
submit_download(self, report_request):
self.normalize_request(report_request)
response = self.service_client.SubmitGenerateReport(report_request)
SubmitGenerateReport starts a sequence of events ending with a call to "_SeviceCall.init" function within "service_client.py", returning an exception "Type not found: 'Aggregation'"
try:
response = self.service_client.soap_client.service.__getattr__(self.name)(*args, **kwargs)
return response
except Exception as ex:
if need_to_refresh_token is False \
and self.service_client.refresh_oauth_tokens_automatically \
and self.service_client._is_expired_token_exception(ex):
need_to_refresh_token = True
else:
raise ex
Can anyone shed some light? .
Thanks

Please be sure to set Aggregation e.g., as shown here.
aggregation = 'Daily'
If the report type does not use aggregation, you can set Aggregation=None.
Does this help?

This may be a bit late 2 months after the fact but maybe this will help someone else. I had the same error (though I suppose it may not be the same issue). It does look like you did what I did (and I'm sure others will as well): copy-paste the Microsoft example code and tried to run it only to find that it didn't work.
I spent quite some time trying to debug the issue and it looked to me like the XML wasn't being searched correctly. I was using suds-py3 for the script at the time so I tried suds-community and everything just worked after that.
I also re-read the Bing Ads API walkthrough for getting started again and found that they recommend suds-jurko instead.
Long story short: If you want to use the bingads API don't use suds-py3, use either suds-community (which I can confirm works for everything I've used the API for) or suds-jurko (which is the one recommended by Microsoft).

Is it possible to inject python code in Kwargs and how could I prevent this user input

I'm at the moment in the middle of writing my Bachelor thesis and for it creating a database system with Postgres and Flask.
To ensure the safety of my data, I was working on a file to prevent SQL injections, since a user should be able to submit a string via Http request. Since most of my functions which I use to analyze the Http request use Kwargs and a dict based on JSON in the request I was wondering if it is possible to inject python code into those kwargs.
And If so If there are ways to prevent that.
To make it easier to understand what I mean, here are some example requests and code:
def calc_sum(a, b):
c = a + b
return c
#app.route(/<target:string>/<value:string>)
def handle_request(target,value):
if target == 'calc_sum':
cmd = json.loads(value)
calc_sum(**cmd)
example Request:
Normal : localhost:5000/calc_sum/{"a":1, "b":2}
Injected : localhost:5000/calc_sum/{"a":1, "b:2 ): print("ham") def new_sum(a=1, b=2):return a+b":2 }
Since I'm not near my work, where all my code is I'm unable to test it out. And to be honest that my code example would work. But I hope this can convey what I meant.
I hope you can help me, or at least nudge me in the right direction. I've searched for it, but all I can find are tutorials on "who to use kwargs".
Best regards.

Yes you, but not in URL, try to use arguments like these localhost:5000/calc_sum?func=a+b&a=1&b=2
and to get these arguments you need to do this in flask
#app.route(/<target:string>)
def handle_request(target):
if target == 'calc_sum':
func= request.args.get('func')
a = request.args.get('a')
b = request.args.get('b')
result = exec(func)
exec is used to execute python code in strings

python requests...or something else... mysteriously caching? hashes don't change right when file does

I have a very odd bug. I'm writing some code in python3 to check a url for changes by comparing sha256 hashes. The relevant part of the code is as follows:
from requests import get
from hashlib import sha256
def fetch_and_hash(url):
file = get(url, stream=True)
f = file.raw.read()
hash = sha256()
hash.update(f)
return hash.hexdigest()
def check_url(target): # passed a dict containing hash from previous examination of url
t = deepcopy(target)
old_hash = t["hash"]
new_hash = fetch_and_hash(t["url"])
if old_hash == new_hash:
t["justchanged"] = False
return t
else:
t["justchanged"] = True
return handle_changes(t, new_hash) # records the changes
So I was testing this on an old webpage of mine. I ran the check, recorded the hash, and then changed the page. Then I re-ran it a few times, and the code above did not reflect a new hash (i.e., it followed the old_hash == new_hash branch).
Then I waited maybe 5 minutes and ran it again without changing the code at all except to add a couple of debugging calls to print(). And this time, it worked.
Naturally, my first thought was "huh, requests must be keeping a cache for a few seconds." But then I googled around and learned that requests doesn't cache.
Again, I changed no code except for print calls. You might not believe me. You might think "he must have changed something else." But I didn't! I can prove it! Here's the commit!
So what gives? Does anyone know why this is going on? If it matters, the webpage is hosted on a standard commercial hosting service, IIRC using Apache, and I'm on a lousy local phone company DSL connection---don't know if there are any serverside caching settings going on, but it's not on any kind of CDN.
So I'm trying to figure out whether there is some mysterious ISP cache thing going on, or I'm somehow misusing requests... the former I can handle; the latter I need to fix!

Why does TheyWorkForYou (TWFY) web API always returns '{}'

I'm calling a web API exposed by TheyWorkForYou (TWFI).
http://www.theyworkforyou.com/api/
I'm using the Python bindings provided by twfython:
http://code.google.com/p/twfython/
I wrote some code to call this API a few months ago, at which time it worked fine. But now I dig it out to run it again, no matter what query I ask of the API, it always returns '{}' (an empty dictionary). For example the following code, which should return a list of all MPs:
from twfy import TWFY
API_KEY = 'XXXXXXXXXXXXXXXXXXXXXX'
twfy = TWFY.TWFY(API_KEY)
print twfy.api.getMPs(output='js')
Am I being really dumb? What else should I check?

You can run the getMPs call on their website directly, and it also produces no output. So you're probably right about there actually being no MPs at the moment.
Do you get the same output if you call getMSPs? This one seems like it should return data.

From the horse's mouth, MAtthew Somerville at ORG:
The API is working as documented - when there is no MP (ie. everywhere between dissolution and election, getMP will return no MP unless you specify the always_return parameter (which is why that parameter exists). This has always been the case after e.g. death of MP, resignation of Iris Robinson.
Also, getMPs (note the 's') will not return any MPs for a date for which there are no MPs - so you should specify the dissolution date if you want the list of MPs as on that date (and sorry there's not an always_return option there)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.