Something faster than urllib2.open()? - python

I'm building a thing that gathers data from a site. Sometimes it has to go through >10,000 pages, and opening each one with urllib2.urlopen() takes time. I'm not very hopeful about this, but does anyone know of a faster way to get html from a site?
my code is this :
import urllib, json, time
import requests
##########################
start_time = time.time()
##########################
query = "hill"
queryEncode = urllib.quote(query)
url = 'https://www.googleapis.com/customsearch/v1?key={{MY API KEY}}&cx={{cxKey}}:omuauf_lfve&fields=queries(request(totalResults))&q='+queryEncode
response = urllib.urlopen(url)
data = json.loads(str(response.read()))
##########################
elapsed_time = time.time() - start_time
print " url to json time : " + str(elapsed_time)
##########################
And the output is
url to json time : 4.46600008011
[Finished in 4.7s]

Related

How can I use multithreading with requests?

Hello I have this code using Python which use the requests module :
import requests
url1 = "myurl1" # I do not remember exactly the exact url
reponse1 = requests.get(url1)
temperature1 = reponse1.json()["temperature"]
url2 = "myurl2" # I do not remember exactly the exact url
reponse2 = requests.get(url2)
temperature2 = reponse2.json()["temp"]
url3 = "myurl3" # I do not remember exactly the exact url
reponse3 = requests.get(url3)
temperature3 = reponse3.json()[0]
print(temperature1)
print(temperature2)
print(temperature3)
And actually I have to tell you this is a little bit slow... Have you got a solution to improve the speed of my code ? I thought to use multi threading but I don't know how to use it...
Thank you very much !
Try Python executors:
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from multiprocessing import cpu_count
urls = ['/url1', '/url2', '/url3']
with ThreadPoolExecutor(max_workers=2*cpu_count()) as executor:
future_to_url = {executor.submit(requests.get, url): url for url in urls}
for future in as_completed(future_to_url):
response = future.result() # TODO: handle exceptions here
url = future_to_url[future]
# TODO: do something with that data

Print Updated Variable using Python Threading

I'm trying to print an updated value and store it in a CSV file. Im using threading and the print would be every 1 second, however after every second that ellapses its the same value that is printed. Can someone help?
import urllib.request, urllib.parse, urllib.error
import json
import threading
import time
localtime = time.asctime( time.localtime(time.time()))
url = 'api'
uh = urllib.request.urlopen(url)
data = uh.read().decode()
js =json.loads(data)
def last_price():
threading.Timer(1.0, last_price).start()
print(js['last'])
print(localtime)
last_price()
The variable js is currently evaluated only once. If you want to query the API every second, move the query code inside the function being executed by the timer:
url = 'api'
def last_price():
localtime = time.asctime( time.localtime(time.time()))
uh = urllib.request.urlopen(url)
data = uh.read().decode()
js = json.loads(data)
print(js['last'])
print(localtime)
threading.Timer(1.0, last_price).start()
last_price()

Wunderground api search bar

I'm currently writting a program that will search for the weather.
I'm trying to create an option where you can search your location however it doesn't seem to be working.
from urllib.request import Request, urlopen
import json
import re
location = input('location you would like to know the weather for')
API_KEY = '<API-KEY>'
url = 'http://python-weather-api.googlecode.com/svn/trunk/ python-weather-api' + API_KEY +'/geolookup/conditions/q/IA/'+ location +'.json'
response = urllib.request.Request(url)
def response as request(url)
json_string = response.read().decode('utf8')
parsed_json = json.loads(json_string)
location = parsed_json['location']['city']
temp_f = parsed_json['current_observation']['temp_f']
print("Current temperature in %s is: %s" % (location, temp_f))
response.close()
I keep on recieving this error
Can't comment yet (reputation too low since I just joined SO), but regarding your "urllib is not defined" issue, that has to do with how you import the urlopen function.
Instead of:
urllib.urlopen(url)
try:
urlopen(url)
EDIT: Here's your code, fixed:
from urllib.request import urlopen
import json
location = input('location you would like to know the weather for')
API_KEY = '<API-KEY>'
url = 'http://api.wunderground.com/api/' + API_KEY + '/geolookup/conditions/q/IA/'+ str(location) +'.json'
response = urlopen(url)
json_string = response.read().decode('utf8')
parsed_json = json.loads(json_string)
location = parsed_json['location']['city']
temp_f = parsed_json['current_observation']['temp_f']
print("Current temperature in %s is: %s" % (location, temp_f))
Works fine for Tama and other cities in IA. Watch out though, place names like Des Moines won't work, because spaces aren't allowed in the URLs - you'll have to take care of that. (The example for the API suggests _ for spaces http://www.wunderground.com/weather/api/d/docs?MR=1). Good luck!
Hey not sure if you are still stuck on this but here is my wunderground project that should have what you're looking for https://github.com/Oso9817/weather_texts
Where you have str.input, you need to use str(location). Right now if you go into a python repl and type str, you will find out that it is a reserved keyword. You want to use the variable location you are getting from the input of the user and not the str object.

http head request slows down script (python requests)

I'm working on a python script that iterates on a list of urls containing .mp3 files; the aim is to extract the content-length from each url through a head request, using the requests library.
However, I noticed that the head requests slow down significantly the script; isolating the piece of code involved, I get a time of execution of 1.5 minutes (200 urls/requests):
import requests
import time
print("start\n\n")
t1 = time.time()
for n in range(200):
response = requests.head("url.mp3")
print(response,"\n")
t2 = time.time()
print("\n\nend\n\n")
print("time: ",t2-t1,"s")
A good solution for you could be grequests
import grequests
requests = (grequests.get('http://127.0.0.1/%i.mp3' % i) for u in range(200))
for code grequests.map(rs):
print 'Status code %i' % code.status_code

Python - Twisted : POST in a form

Hi Guys !
I'm still discovering Twisted and I've made this script to parse the content of HTML table into excel. This script is working well ! My question is how can I do the same, for only one webpage (http://bandscore.ielts.org/) but with a lot of POST requests to be able to fetch all the results, parse it with beautifulSoup and then put them into excel ?
Parsing the source and putting it into excel is O.K, but I don't know how to do a POST request with Twisted in order to implement that in
This is the script I use for parsing (with Twisted) a lot of different pages
(I want to be able to write the same script, but with a lot of different POST data on the same page and not a lot of pages):
from twisted.web import client
from twisted.internet import reactor, defer
from bs4 import BeautifulSoup as BeautifulSoup
import time
import xlwt
start = time.time()
wb = xlwt.Workbook(encoding='utf-8')
ws = wb.add_sheet("BULATS_IA_PARSED")
global x
x = 0
Countries_List = ['Afghanistan','Armenia','Brazil','Argentina','Armenia','Australia','Austria','Azerbaijan','Bahrain','Bangladesh','Belgium','Belize','Bolivia','Bosnia and Herzegovina','Brazil','Brunei Darussalam','Bulgaria','Cameroon','Canada','Central African Republic','Chile','China','Colombia','Costa Rica','Croatia','Cuba','Cyprus','Czech Republic','Denmark','Dominican Republic','Ecuador','Egypt','Eritrea','Estonia','Ethiopia','Faroe Islands','Fiji','Finland','France','French Polynesia','Georgia','Germany','Gibraltar','Greece','Grenada','Hong Kong','Hungary','Iceland','India','Indonesia','Iran','Iraq','Ireland','Israel','Italy','Jamaica','Japan','Jordan','Kazakhstan','Kenya','Kuwait','Latvia','Lebanon','Libya','Liechtenstein','Lithuania','Luxembourg','Macau','Macedonia','Malaysia','Maldives','Malta','Mexico','Monaco','Montenegro','Morocco','Mozambique','Myanmar (Burma)','Nepal','Netherlands','New Caledonia','New Zealand','Nigeria','Norway','Oman','Pakistan','Palestine','Papua New Guinea','Paraguay','Peru','Philippines','Poland','Portugal','Qatar','Romania','Russia','Saudi Arabia','Serbia','Singapore','Slovakia','Slovenia','South Africa','South Korea','Spain','Sri Lanka','Sweden','Switzerland','Syria','Taiwan','Thailand','Trinadad and Tobago','Tunisia','Turkey','Ukraine','United Arab Emirates','United Kingdom','United States','Uruguay','Uzbekistan','Venezuela','Vietnam']
urls = ["http://www.cambridgeesol.org/institutions/results.php?region=%s&type=&BULATS=on" % Countries for Countries in Countries_List]
def finish(results):
global x
for result in results:
print 'GOT PAGE', len(result), 'bytes'
soup = BeautifulSoup(result)
tableau = soup.findAll('table')
try:
rows = tableau[3].findAll('tr')
print("Fetching")
for tr in rows:
cols = tr.findAll('td')
y = 0
x = x + 1
for td in cols:
texte_bu = td.text
texte_bu = texte_bu.encode('utf-8')
#print("Writing...")
#print texte_bu
ws.write(x,y,td.text)
y = y + 1
except(IndexError):
print("No IA for this country")
pass
reactor.stop()
waiting = [client.getPage(url) for url in urls]
defer.gatherResults(waiting).addCallback(finish)
reactor.run()
wb.save("IALOL.xls")
print "Elapsed Time: %s" % (time.time() - start)
Thank you very much in advance for your help !
You have two options. Keep using getPage and tell it to POST instead of GET or use Agent.
The API documentation for getPage directs you to the API documentation for HTTPClientFactory to discover additional supported options.
The latter API documentation explicitly covers method and implies (but does a bad job of explaining) postdata. So, to make a POST with getPage:
d = getPage(url, method='POST', postdata="hello, world, or whatever.")
There is a howto-style document for Agent (linked from the overall web howto documentation index. This gives examples of sending a request with a body (ie, see the FileBodyProducer example).

Categories

Resources