Python requests acting like it has a cache - python

I am trying to get a color hexcode from an XML page on my website and update a script within 5-10 seconds. I am able to read the hexcode just fine, and I am able to change the value in the XML file just fine, but the script takes awhile to reflect the update.
I want the script to update every 5 seconds by checking the XML file from my webserver, however it takes about 1 full minute before the code actually sees the update. Is my python script somehow caching the XML file? Is my webserver possibly sending a cached version? (Viewing the XML file in chrome refreshes instantly though.)
Python code:
import time
import serial
import requests
from bs4 import BeautifulSoup
ser = serial.Serial('/dev/ttyACM0',9600)
print('Connected to Arduino!')
while (True):
print('Connecting to website...')
page = requests.get('http://xanderluciano.com/pi/color.xml', timeout=5)
soup = BeautifulSoup(page.text, 'html.parser')
print('scraped hexcode: ' + soup.color.string)
hex = soup.color.string
ser.write(hex.encode('utf-8'))
print(ser.readline())
time.sleep(5);
XML File:
<?xml version="1.0" encoding="UTF-8"?>
<ledstrip>
<color>2196f3</color>
<flash>false</flash>
<freq>15</freq>
</ledstrip>

The solution was that my webserver used NGINX as a server side cache controller, I opted to disable this cache control during the development stage, just so that I could see the results instantly. Most likely there is a better way of pushing data rather than continually polling the webserver for it.

Related

How can i keep a python app running 24/7 on host?

I wrote a mini-app, that scrapes my school's Website then looks for the title of the last post, compare it to the old title, if it's not the same, it then sends me an email.
In order for the app to work properly it needs to keep running 24/7 so that the value of the title variable is correct.
Here's the code:
import requests
from bs4 import BeautifulSoup
import schedule, time
import sys
import smtplib
#Mailing Info
from_addr = ''
to_addrs = ['']
message = """From: sender
To: receiver
Subject: New Post
A new post has been published
visit the website to view it:
"""
def send_mail(msg):
try:
s = smtplib.SMTP('localhost')
s.login('email',
'password')
s.sendmail(from_addr, to_addrs, msg)
s.quit()
except smtplib.SMTPException as e:
print(e)
#Scraping
URL = ''
title = 'Hello World'
def check():
global title
global message
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
main_section = soup.find('section', id='spacious_featured_posts_widget-2')
first_div = main_section.find('div', class_='tg-one-half')
current_title = first_div.find('h2', class_='entry-title').find('a')['title']
if current_title != title:
send_mail(message)
title = current_title
else:
send_mail("Nothing New")
schedule.every(6).hours.do(check)
while True:
schedule.run_pending()
time.sleep(0.000001)
So my question is How do I keep this code running on host using Cpanel?
I know I can use cron jobs to run it every like 2 hours or something, but I don't know how to keep the script itself running, using a terminal doesn't work when I close the page the app gets terminated
So - generally to run programs for an extended period, they would need to be daemonised. Essentially disconnected from your terminal with a double-fork, and a set-sid. Having that said, I've never actually done it myself, since it was usually either (a) the wrong solution, or (b) it's re-inventing the wheel (https://github.com/thesharp/daemonize).
In this case, I think a better course of action would be to invoke the script every 6 hours, rather than have it internally do something every 6 hours. Making your program resilient to a restart is pretty much how most systems are kept reliable, and putting them in a 'cradle' that automatically restarts them.
In your case, I'd suggest saving the title to a file, and reading from and writing to that file when the script is invoked. It would make your script simplier, and more robust, and you'd be using battle-hardened tools for the job.
A couple of years down the line, when your writing code that needs to survive the total machine crashing, and being replaced (within 6 hours, with everything installed) you can use some external form of storage (like a database) instead of a file, to make your system even more resiliant.

Unable to load torrent file using raw base64 form and python xmlrpc client

I am trying to load a torrent file into rtorrent using xmlrpc with the following python3 code:
import xmlrpc.client
server_url = "https://%s:%s#%s/xmlrpc" % ('[REDACTED]', '[REDACTED]', '[REDACTED]');
server = xmlrpc.client.Server(server_url);
with open("test.torrent", "rb") as torrent:
server.load.raw_verbose(xmlrpc.client.Binary(torrent.read()),"d.delete_tied=","d.custom1.set=Test","d.directory.set=/home/[REDACTED]/files")
The load_raw command returns without an error (return code 0), but the torrent does not appear in the rutorrent UI. I seem to be experiencing the same thing as from this reddit post, but I am using the Binary class without any luck.
I am using a Whatbox seedbox.
EDIT:
After enabling logging I am seeing
1572765194 E Could not create download, the input is not a valid torrent.
when trying to load the torrent file, however manually loading the torrent file through the rutorrent UI works fine.
I needed to add "" as the first argument:
server.load.raw_verbose("",xmlrpc.client.Binary(torrent.read()),"d.delete_tied=","d.custom1.set=Test","d.directory.set=/home/[REDACTED]/files")
Not sure why, the docs don't seem to show this is needed.

Python - How to search for a zip file that resides in an iframe on https

Python - 2.7.5
Google Chrome
First off I am self taught coder and will accept any critique and/or suggestions to any of my posted codes below. This issue has been a joy to work through because I love challenging myself but I am afraid I have hit a brick wall and need some guidance. I will be as detailed as possible below to fully explain the overall picture of my script and then show where I am at with the actual issue that is explained in the title.
I am putting together a script that will go out and download data automatically, upzip, and export to a GDB. We serve a wide region of users and have a very large enterprise SDE setup containing large amount of public data that we have to go out and search and update for our end users. Most of our data is updated monthly by local government entities and we have to go out and search for the data manually, download, unzip, QAQC, etc. I am wanting to put a script a together that will automate the first part of this process by going out and downloading all my data for me and exporting to a local GDB, from there I can QAQC everything and upload to our SDE for our users to access.
The process has been pretty straight forward so far until I got to this issue I have before me. My script will search a webpage for specific keywords and find the relevant link and begin the download. For this post I will use two examples, one that works and one that is currently giving me issues. What works is my function for searching and downloading the Metro GIS dataset and below shows my current process for finding this. So far all http websites I have included will use the posted function below. Like Metro is being shown I plan on having a defined function for each group of data.
import requests, zipfile, StringIO, time, arcpy, urllib2, urlparse
from BeautifulSoup import BeautifulSoup
arcpy.env.overwriteOutput = True
workPath = -- #The output GDB
timestr = time.strftime("%Y%m%d")
gdbName = "GlobalSDEUpdate_" + timestr
gdbPath = workPath + "\\" + gdbName + ".gdb"
class global_DataFinder(object):
def __init__(self):
object.__init__(self)
self.gdbSetup()
self.metro()
def gdbSetup(self):
arcpy.CreateFileGDB_management(workPath, gdbName)
def fileDownload(self, key, url, dlPath, dsName):
page = urllib2.urlopen(url).read()
urlList = []
soup = BeautifulSoup(page)
soup.prettify()
for link in soup.findAll('a', href = True):
if not 'http://' in link['href']:
if urlparse.urljoin(url, link['href']) not in urlList:
zipDL = urlparse.urljoin(url, link['href'])
if zipDL.endswith(".zip"):
if key in zipDL:
urlList.append(zipDL)
for x in urlList:
print x
r = requests.get(x, stream=True)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
z.extractall(dlPath)
arcpy.CreateFeatureDataset_management(gdbPath, dsName)
arcpy.env.workspace = dlPath
shpList = []
for shp in arcpy.ListFeatureClasses():
shpList.append(shp)
arcpy.FeatureClassToGeodatabase_conversion(shpList, (gdbPath + "\\" + dsName))
del shpList[:]
def metro(self):
key = "METRO_GIS_Data_Layers"
url = "http://www.ridemetro.org/Pages/NewsDownloads.aspx"
dlPath = -- *#Where my zipfiles output to*
dsName = "Metro"
self.fileDownload(key, url, dlPath, dsName)
global_DataFinder()
As you can see above this is the method I started with using Metro as my first testing point and this is currently working great. I was hoping all my sites going forward would like this but when I got to FEMA I ran into an issue.
The website National Flood Hazard Layer (NFHL) Status hosts floodplain data for many counties across the country is available for free to any who wish to use it. When arriving at the website you will see that you can search for the county you want, then the table queries out the search, then you can simply click and download the county you desire. When checking the source this is what I came across and noticed its in an iframe.
When accessing the iframe source link through Chrome and checking the png source url this is what you get - https://hazards.fema.gov/femaportal/NFHL/searchResult
Now here is where my problem lies, unlike http sites I have quickly learned that accessing a secured https site and scraping the page is different especially when its using javascript to show the table. I have spent hours searching through forums and tried different python packages like selenium, mechanize, requests, urllib, urllib2, and I seem to always hit a dead-end before I can securely establish a connection and parse the webpage and search for my counties zipfile. The code below shows the closest I have gotten and shows the error code I am getting.
(I always test in a separate script and then when it works I bring it over to my main script, so thats why this code snippet below is separated from my original)
import urllib2, httplib, socket, ssl
from BeautifulSoup import BeautifulSoup
url = "http://www.floodmaps.fema.gov/NFHL/status.shtml"
def test():
page = urllib2.urlopen(url).read()
urlList = []
soup = BeautifulSoup(page)
soup.prettify()
for link in soup.findAll("iframe", src=True):
r = urllib2.urlopen(link['src'])
iFrame = link['src']
print iFrame
def connect_patched(self):
"Connect to a host on a given (SSL) port."
sock = socket.create_connection((self.host, self.port),
self.timeout, self.source_address)
if self._tunnel_host:
self.sock = sock
self._tunnel()
self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file,
ssl_version=ssl.PROTOCOL_SSLv2)
httplib.HTTPSConnection.connect = connect_patched
test()
Error I get when running this test
urllib2.URLError: urlopen error [Errno 6] _ssl.c:504: TLS/SSL connection has been closed
I am hoping a more experienced coder can see what I have done and tell me if my current methods are the way to go and if so how to get past this final error and parse the datatable properly.
Working Edits with #crmackey
import requests
import os
import zipfile
from pyquery import PyQuery
from requests.packages.urllib3.exceptions import InsecureRequestWarning, InsecurePlatformWarning, SNIMissingWarning
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
# disable ssl warnings (we are not verifying SSL certificates at this time...future ehnancement?)
for warning in [SNIMissingWarning, InsecurePlatformWarning, InsecureRequestWarning]:
requests.packages.urllib3.disable_warnings(warning)
def download_zips(out_path):
url = 'http://www.floodmaps.fema.gov/NFHL/status.shtml'
download_prefix = 'https://hazards.fema.gov/femaportal/NFHL'
pq = PyQuery(requests.get(url, verify=False).content) #verify param important for SSL
src = pq.find('iframe').attr('src')
pq = PyQuery(requests.get(src, verify=False).content)
table = pq.find('table')
for a in table.find('a'):
href = a.attrib.get('href')
print href
url = '/'.join([download_prefix, href])
print url
r = requests.get(url, stream=True, verify=False)
out_zip = os.path.join(out_path, href.split('=')[-1])
with open(out_zip, 'wb') as f:
for chunk in r.iter_content(1024 *16): #grab 1KB at a time
if chunk:
f.write(chunk)
print 'downloaded zip: "{}"'.format(href.split('=')[-1])
out_path = r"C:\Users\barr\Desktop\Test"
download_zips(out_path)
All I added was the httplib and changed the HTTPConnection at the top. That allowed to me connect to the site using your script. Now here is the current problem. I am only getting 1 zip file in my out_path, and the zip file is empty. I checked the printed source in the debug window and its showing its trying to download the TERRITORY OF THE VIRGIN ISLAND zip file from the table so it looks like its trying but its not downloading anything. After it outputs that one empty zip file the script finishes and brings up no further error messages. I temporarily removed your lines that unzipped the file because they were returning an error since the folder was empty.
I was able to get the zip files downloaded by using the requests module and also opted for using PyQuery instead of Beautiful Soup. I think the issue you were facing has to do with the SSL cert validation, where the requests module will allow you to skip checking the certificate if you set the verify parameter to False.
The function below will download all the zip files and unzip them, from there, you can import the shapefiles into your geodatabase:
import requests
import os
import zipfile
from pyquery import PyQuery
from requests.packages.urllib3.exceptions import InsecureRequestWarning, InsecurePlatformWarning, SNIMissingWarning
# disable ssl warnings (we are not verifying SSL certificates at this time...future ehnancement?)
for warning in [SNIMissingWarning, InsecurePlatformWarning, InsecureRequestWarning]:
requests.packages.urllib3.disable_warnings(warning)
def download_zips(out_path):
url = 'http://www.floodmaps.fema.gov/NFHL/status.shtml'
download_prefix = 'https://hazards.fema.gov/femaportal/NFHL'
pq = PyQuery(requests.get(url, verify=False).content) #verify param important for SSL
src = pq.find('iframe').attr('src')
pq = PyQuery(requests.get(src, verify=False).content)
table = pq.find('table')
for a in table.find('a'):
href = a.attrib.get('href')
url = '/'.join([download_prefix, href])
r = requests.get(url, stream=True, verify=False)
out_zip = os.path.join(out_path, href.split('=')[-1])
with open(out_zip, 'wb') as f:
for chunk in r.iter_content(1024 *16): #grab 1KB at a time
if chunk:
f.write(chunk)
print 'downloaded zip: "{}"'.format(href.split('=')[-1])
# do more stuff like unzip?
unzipped = out_zip.split('.zip')[0]
with zipfile.Zipfile(out_zip, 'r') as f:
f.extractall(unzipped)

How do I submit data to a web form in python?

I'm trying to automate the process of creating an account for something, lets call it X, but I cant figure out what to do.
I saw this code somewhere,
import urllib
import urllib2
import webbrowser
data = urllib.urlencode({'q': 'Python'})
url = 'http://duckduckgo.com/html/'
full_url = url + '?' + data
response = urllib2.urlopen(full_url)
with open("results.html", "w") as f:
f.write(response.read())
webbrowser.open("results.html")
But I cant figure out how to modify it for my use.
I would highly recommend utilizing Selenium+Webdriver for this, since your question appears UI and browser-based. You can install Selenium via 'pip install selenium' in most cases. Here are a couple of good references to get started.
- http://selenium-python.readthedocs.io/
- https://pypi.python.org/pypi/selenium
Also, if this process needs to drive the browser headlessly, look into including PhantomJS (via GhostDriver), which can be downloaded from the phantomjs.org website.

Problems crawling wordreference

I am trying to crawl wordreference, but I am not succeding.
The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.
So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grĂșa.
This is my code:
import lxml.html as lh
import urllib2
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print i
The result is that I get an empty list.
I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.
Thank you very much
It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.
Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):
import lxml.html as lh
import requests
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print(i)
Prints:
grulla
grĂșa
plataforma
...
grulla blanca
grulla trompetera

Categories

Resources