I'm trying to print an updated value and store it in a CSV file. Im using threading and the print would be every 1 second, however after every second that ellapses its the same value that is printed. Can someone help?
import urllib.request, urllib.parse, urllib.error
import json
import threading
import time
localtime = time.asctime( time.localtime(time.time()))
url = 'api'
uh = urllib.request.urlopen(url)
data = uh.read().decode()
js =json.loads(data)
def last_price():
threading.Timer(1.0, last_price).start()
print(js['last'])
print(localtime)
last_price()
The variable js is currently evaluated only once. If you want to query the API every second, move the query code inside the function being executed by the timer:
url = 'api'
def last_price():
localtime = time.asctime( time.localtime(time.time()))
uh = urllib.request.urlopen(url)
data = uh.read().decode()
js = json.loads(data)
print(js['last'])
print(localtime)
threading.Timer(1.0, last_price).start()
last_price()
Related
I am trying to read a table on a website. The first (initial) read is correct, but the subsequent requests in the loop are out of date (the information doesn't change even though the website changes). Any suggestions?
The link shown in the code is not the actual website that I am looking at. Also, I am going through proxy server.
I don't get an error, just out of date information.
Here is my code:
import time
import urllib.request
from pprint import pprint
from html_table_parser.parser import HTMLTableParser
import pandas as pd
def url_get_contents(url):
#making request to the website
req = urllib.request.Request(url=url)
f = urllib.request.urlopen(req)
return f.read()
link='https://www.w3schools.com/html/html_tables.asp'
xhtml = url_get_contents(link).decode('utf-8')
p = HTMLTableParser()
p.feed(xhtml)
stored_page=p.tables[0]
while True:
try:
xhtml = url_get_contents(link).decode('utf-8')
p = HTMLTableParser()
p.feed(xhtml)
print('now: ',p.tables[0] )
time.sleep(120)
continue
# To handle exceptions
except Exception as e:
print("error")
First of all thank you for taking your time to read through this post. I'd like to begin that I'm very new to programming in general and that I seek advice to solve a problem.
I'm trying to create a script that checks if the content of a html page has been changed. I'm doing this to monitor certain website pages for changes. I've managed to find a script and I have made some alterations that it will go through a list of URL's checking if the page has been changed. The problem here is that its checking the page sequential. This means that it will go through the list checking the URL's one by one while I want the script to run the URL's parallel. I'm also using a while loop to continue checking the pages because even if a change took place it will still have to monitor the page. I could write a thousand more words on explaining what i'm trying to do so therefor have a look at the code:
import requests
import time
import smtplib
from email.message import EmailMessage
import hashlib
from urllib.request import urlopen
url = ["https://www.youtube.be", "https://www.google.com", "https://www.google.be"]
i = 0
response = urlopen(url[i]).read()
currentHash = hashlib.sha224(response).hexdigest()
while True:
try:
response = urlopen(url[i]).read()
currentHash = hashlib.sha224(response).hexdigest()
print('checking')
time.sleep(10)
response = urlopen(url[i]).read()
newHash = hashlib.sha224(response).hexdigest()
i +=1
if newHash == currentHash:
continue
else:
print('Change detected')
print (url[i])
time.sleep(10)
continue
except Exception as e:
i = 0
print('resetting increment')
continue
What you want to do is called multi-threading.
Conceptually this is how it works:
import hashlib
import time
from urllib.request import urlopen
import threading
# Define a function for the thread
def f(url):
initialHash = None
while True:
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
if not initialHash:
initialHash = currentHash
if currentHash != initialHash:
print('Change detected')
print (url)
time.sleep(10)
continue
return
# Create two threads as follows
for url in ["https://www.youtube.be", "https://www.google.com", "https://www.google.be"]:
t = threading.Thread(target=f, args=(url,))
t.start()
Running example of OP code Using Thread Executor
Code
import concurrent.futures
import time
import requests
import hashlib
from urllib.request import urlopen
def check_change(url):
'''
Checks for a change in web page contents by comparing current to previous hash
'''
try:
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
time.sleep(10)
response = urlopen(url).read()
newHash = hashlib.sha224(response).hexdigest()
if newHash != currentHash:
return "Change to:", url
else:
return None
except Exception as e:
return "Error", e, url
page_urls = ["https://www.youtube.be", "https://www.google.com", "https://www.google.be"]
while True:
# We can use a Thread Execution Manager to ensure threads are clean up properly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = []
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(check_change, url): url for url in page_urls}
for future in concurrent.futures.as_completed(future_to_url):
# Output result of each thread upon it's completion
url = future_to_url[future]
try:
status = future.result()
if status:
print(*status)
else:
print(f'No change to: {url}')
except Exception as exc:
print('Site %r generated an exception: %s' % (url, exc))
time.sleep(10) # Wait 10 seconds before rechecking sites
Output
Change to: https://www.google.com
Change to: https://www.google.be
Change to: https://www.youtube.be
Change to: https://www.google.be
Change to: https://www.google.com
Change to: https://www.youtube.be
Change to: https://www.google.be
Change to: https://www.google.com
...
I want to log for a start how many successful requests with status 200 I have after I complete the web scraping of a page I use the following part
import requests
import csv
import selenium
from selenium import webdriver
import time
from time import sleep
import datetime
mycount = 0
class Parser(object):
ses = requests.Session()
# parse a single item to get information
def parse(self, urls):
url = urls[1]
try:
r = self.ses.get(url)
time.sleep(3)
if r.status_code == 200:
mycount=mycount+1
and later one when I have mycount to pass it to a list and a csv
if __name__ == "__main__":
with Pool(4) as p:
print('Just before parsing..Page')
records = p.map(parser.parse, web_links)
with open(my_log_path,'a',encoding='utf-8',newline='') as logf:
writer = csv.writer(logf,delimiter=';')
writer.writerow(logs)
But I get that my local variable is referenced before assignment
Why mycount is treated as local variable if it is on the top and outside a function? How can I fix this?
thank you
Your class does not have access to mycount because it's a global variable. You should use global inside your class before modifying it:
def parse(self, urls):
global mycount
url = urls[1]
I'm building a thing that gathers data from a site. Sometimes it has to go through >10,000 pages, and opening each one with urllib2.urlopen() takes time. I'm not very hopeful about this, but does anyone know of a faster way to get html from a site?
my code is this :
import urllib, json, time
import requests
##########################
start_time = time.time()
##########################
query = "hill"
queryEncode = urllib.quote(query)
url = 'https://www.googleapis.com/customsearch/v1?key={{MY API KEY}}&cx={{cxKey}}:omuauf_lfve&fields=queries(request(totalResults))&q='+queryEncode
response = urllib.urlopen(url)
data = json.loads(str(response.read()))
##########################
elapsed_time = time.time() - start_time
print " url to json time : " + str(elapsed_time)
##########################
And the output is
url to json time : 4.46600008011
[Finished in 4.7s]
this is a program input multiple urls calling url localhost:8888/api/v1/crawler
this program taking 1+hour to run its ok but it block other apis.
when it running other any api will not work till the existing api end so i want to run this program asynchronously so how can i achieve with the same program
#tornado.web.asynchronous
#gen.coroutine
#use_args(OrgTypeSchema)
def post(self, args):
print "Enter In Crawler Match Script POST"
print "Argsssss........"
print args
data = tornado.escape.json_decode(self.request.body)
print "Data................"
import json
print json.dumps(data.get('urls'))
from urllib import urlopen
from bs4 import BeautifulSoup
try:
urls = json.dumps(data.get('urls'));
urls = urls.split()
import sys
list = [];
# orig_stdout = sys.stdout
# f = open('out.txt', 'w')
# sys.stdout = f
for url in urls:
# print "FOFOFOFOFFOFO"
# print url
url = url.replace('"'," ")
url = url.replace('[', " ")
url = url.replace(']', " ")
url = url.replace(',', " ")
print "Final Url "
print url
try:
site = urlopen(url) ..............
Your post method is 100% synchronous. You should make the site = urlopen(url) async. There is an async HTTP client in Tornado for that. Also good example here.
You are using urllib which is the reason for blocking.
Tornado provides a non-blocking client called AsyncHTTPClient, which is what you should be using.
Use it like this:
from tornado.httpclient import AsyncHTTPClient
#gen.coroutine
#use_args(OrgTypeSchema)
def post(self, args):
...
http_client = AsyncHTTPClient()
site = yield http_client.fetch(url)
...
Another thing that I'd like to point out is don't import modules from inside a function. Although, it's not the reason for blocking but it is still slower than if you put all your imports at the top of file. Read this question.