100,000 HTTP Response Code Checks - python

I've got a list of ~100,000 links that I'd like to check the HTTP Response Code for. What might be the best method to use for doing this check programmatically?
I'm considering using the below Python code:
import requests
try:
for x in range(0, 100000):
r = requests.head(''.join(["http://stackoverflow.com/", str(x)]))
# They'll actually be read from a file, and aren't sequential
print r.status_code
except requests.ConnectionError:
print "failed to connect"
.. but am not aware of the potential side effects of checking such a large number of URLs in a single take. Thoughts?

The only side effect I can think of is time, which you can mitigate by making the requests in parallel. (use http://gevent.org/ or https://docs.python.org/2/library/thread.html).

Related

Checking URL Status without Throwing Error

I'm looking to check to see if 500+ strings in a given dataframe are URLs. I've seen that this can be done using the requests package but I've found that if I provide a URL, instead of receiving the error code 404, my program is crashing.
Because I'm looking to apply this function to a dataframe with many strings not being active URLs, the current function would not work for what I'm looking to accomplish.
I'm wondering if there is a way to adapt the coded below to actually return no (or anything else) in the case that the URL isn't real. For example, providing the url 'http://www.example.commmm' results in an error:
import requests
response = requests.get('http://www.example.com')
if response.status_code == 200:
print('Yes')
else:
print('No')
thanks in advance!
I would try and add a try/except to prevent your code from breaking
try:
print(x)
except:
print("An exception occurred")

Forex-python "Currency Rates Source Not Ready"

I want to use the Forex-python module to convert amounts in various currencies to a specific currency ("DKK") according to a specific date (The last day of a previous month according to a date in the dataframe)
This is the structure of my code:
pd.DataFrame(data={'Date':['2017-4-15','2017-6-12','2017-2-25'],'Amount':[5,10,15],'Currency':['USD','SEK','EUR']})
def convert_rates(amount,currency,PstngDate):
PstngDate = datetime.strptime(PstngDate, '%Y-%m-%d')
if currency != 'DKK':
return c.convert(base_cur=currency,dest_cur='DKK',amount=amount \
,date_obj=PstngDate - timedelta(PstngDate.day))
else:
return amount
and the the new column with the converted amount:
df['Amount, DKK'] = np.vectorize(convert_rates)(
amount=df['Amount'],
currency=df['Currency'],
PstngDate=df['Date']
)
I get the RatesNotAvailableError "Currency Rates Source Not Ready"
Any idea what can cause this? It has previously worked with small amounts of data, but I have many rows in my real df...
I inserted a small print statement into convert.py (part of forex-python) to debug this.
print(response.status_code)
Currently I receive:
502
Read these threads about the HTTP 502 error:
In HTTP 502, what is meant by an invalid response?
https://www.lifewire.com/502-bad-gateway-error-explained-2622939
These errors are completely independent of your particular setup,
meaning that you could see one in any browser, on any operating
system, and on any device.
502 indicates that currently there is a problem with the infrastructure this API uses to provide us with the required data. As I am in need of the data myself I will continue to monitor this issue and keep my post on this site updated.
There is already an issue on Github regarding this issue:
https://github.com/MicroPyramid/forex-python/issues/100
From the source: https://github.com/MicroPyramid/forex-python/blob/80290a2b9150515e15139e1a069f74d220c6b67e/forex_python/converter.py#L73
Your error means the library received a non 200 response code to your request. This could mean the site is down, or it's blocked you momentarily because you're hammering it with requests.
Try replacing the call to c.convert with something like:
from time import sleep
def try_convert(amount, currency, PstngDate):
success = False
while success == False:
try:
res = c.convert(base_cur=currency,dest_cur='DKK',amount=amount \
,date_obj=PstngDate - timedelta(PstngDate.day))
except:
#wait a while
sleep(10)
return res
Or even better, use a library like backoff, to do the retrying for you:
https://pypi.python.org/pypi/backoff/1.3.1

How can I detect the method to request data from this site?

UPDATE: I've put together the following script to use the url for the XML without the time-code-like suffix as recommended in the answer below, and report the downlink powers which clearly fluctuate on the website. I'm getting three hour old, unvarying data.
So it looks like I need to properly construct that (time code? authorization? secret password?) in order to do this successfully. Like I say in the comment below, "I don't want to do anything that's not allowed and welcome - NASA has enough challenges already trying to talk to a forty year old spacecraft 20 billion kilometers away!"
def dictify(r,root=True):
"""from: https://stackoverflow.com/a/30923963/3904031"""
if root:
return {r.tag : dictify(r, False)}
d=copy(r.attrib)
if r.text:
d["_text"]=r.text
for x in r.findall("./*"):
if x.tag not in d:
d[x.tag]=[]
d[x.tag].append(dictify(x,False))
return d
import xml.etree.ElementTree as ET
from copy import copy
import urllib2
url = 'https://eyes.nasa.gov/dsn/data/dsn.xml'
contents = urllib2.urlopen(url).read()
root = ET.fromstring(contents)
DSNdict = dictify(root)
dishes = DSNdict['dsn']['dish']
dp_dict = dict()
for dish in dishes:
powers = [float(sig['power']) for sig in dish['downSignal'] if sig['power']]
dp_dict[dish['name']] = powers
print dp_dict['DSS26']
I'd like to keep track of which spacecraft that the NASA Deep Space Network (DSN) is communicating with, say once per minute.
I learned how to do something similar from Flight Radar 24 from the answer to my previous question, which also still represents my current skills in getting data from web sites.
With FR24 I had explanations in this blog as a great place to start. I have opened the page with the Developer Tools function in the Chrome browser, and I can see that data for items such as dishes, spacecraft and associated numerical data are requested as an XML with urls such as
https://eyes.nasa.gov/dsn/data/dsn.xml?r=293849023
so it looks like I need to construct the integer (time code? authorization? secret password?) after the r= once a minute.
My Question: Using python, how could I best find out what that integer represents, and how to generate it in order to correctly request data once per minute?
above: screen shot montage from NASA's DSN Now page https://eyes.nasa.gov/dsn/dsn.html see also this question
Using a random number (or a timestamp...) in a get parameter tricks the browser into really making the request (instead of using the browser cache).
This method is some kind of "hack" the webdevs use so that they are sure the request actually happens.
Since you aren't using a web browser, I'm pretty sure you could totally ignore this parameter, and still get the refreshed data.
--- Edit ---
Actually r seems to be required, and has to be updated.
#!/bin/bash
wget https://eyes.nasa.gov/dsn/data/dsn.xml?r=$(date +%s) -O a.xml -nv
while true; do
sleep 1
wget https://eyes.nasa.gov/dsn/data/dsn.xml?r=$(date +%s) -O b.xml -nv
diff a.xml b.xml
cp b.xml a.xml -f
done
You don't need to emulate a browser. Simply set r to anything and increment it. (Or use a timestamp)
Regarding your updated question, why avoid sending the r query string parameter when it is very easy to generate it? Also, with the requests module, it's easy to send the parameter with the request too:
import time
import requests
import xml.etree.ElementTree as ET
url = 'https://eyes.nasa.gov/dsn/data/dsn.xml'
r = int(time.time() / 5)
response = requests.get(url, params={'r': r})
root = ET.fromstring(response.content)
# etc....

Python: Make HTTP Errors (500) Not Stop Script

This is the basic part of my code that I need help with. Note I learned python like last week. I don't understand try and exceptions and I know that what I need for this, so if anyone could help that would be great.
url = 'http://google.com/{0}/{1}'.format(variable, variable1)
site = urllib.request.urlopen(url)
That's not the real website but you get the idea. Now I'm running a loop over 5 times per item, then running around 20 different items. So it goes to say
google.com/spiders/(runs 5 times with diff types of spiders)
google.com/dogs/(runs 5 times with diff types of dogs)etc.
Now the 2nd variable is the same on like 90% of the items I'm looping over, but 1 or 2 of them have some of the "types" but not others. So I get an http error 500 because that site doesn't exist. How do I make it basically skip that. Its not something else, I know error 500 isn't the right error I believe, but I know the pages for those items don't exist. So how do I set this up so that it just skips that one if it gets any error.
You can use a try/except block in your loop, like:
try:
url = 'http://google.com/{0}/{1}'.format(variable, variable1)
site = urllib.request.urlopen(url)
except Exception, ex:
print "ERROR - " + str(ex)
You can also just catch specific exceptions - the above code would catch any exception (such as a bug in your code, not a network error)
See here for more: https://wiki.python.org/moin/HandlingExceptions

use many processes to filter a million records

I have python script that works well for a few numbers:
def ncpr (searchnm):
import urllib2
from bs4 import BeautifulSoup
mynumber = searchnm
url = "http://www.domain.com/saveSearchSub.misc?phoneno=" + mynumber
soup = BeautifulSoup(urllib2.urlopen(url))
header = soup.find('td', class_='GridHeader')
result = []
for row in header.parent.find_next_siblings('tr'):
cells = row.find_all('td')
try:
result.append(cells[2].get_text(strip=True))
except IndexError:
continue
if result:
pass
else:
return str(i)
with open("Output.txt3", "w") as text_file:
for i in range(9819838100,9819838200):
myl=str(ncpr(str(i)))
if myl != 'None':
text_file.write((myl)+'\n')
It checks the range of 100 numbers and return the integer that is not present in the database. It takes a few seconds to process 100 records.
I need to process a million numbers starting from different ranges.
For e.g.
9819800000 9819900000
9819200000 9819300000
9829100000 9829200000
9819100000 9819200000
7819800000 7819900000
8819800000 8819900000
9119100000 9119200000
9119500000 9119600000
9119700000 9119800000
9113100000 9113200000
This dictionary will be generated from the list supplied:
mylist=[98198, 98192, 98291, 98191, 78198, 88198, 91191, 91195, 91197, 91131]
mydict={}
for mynumber in mylist:
start_range= int(str(mynumber) + '00000')
end_range=int(str(mynumber+1) +'00000')
mydict[start_range] = end_range
I need to use threads in such a way that I can check 1 million records as quickly as possible.
The problem with your code is not so much about how to parallelize it, but about the fact that you query a single number per request. This means processing a million numbers will generate a million requests, using a million separate HTTP sessions on a million new TCP connections, to www.nccptrai.gov.in. I don't think the webmaster will enjoy this.
Instead, you should find a way to get a database dump of some kind. If that's impossible, restructure your code to reuse a single connection to issue multiple requests. That's discussed here: How to Speed Up Python's urllib2 when doing multiple requests
By issuing all your requests on a single connection you avoid a ton of overhead, and will experience greater throughput as well, hopefully culminating in being able to send a single packet per request and receive a single packet per response. If you live outside India, far from the server, you may benefit quite a bit from HTTP Pipelining as well, where you issue multiple requests without waiting for earlier responses. There's a sort of hack that demonstrates that here: http://code.activestate.com/recipes/576673-python-http-pipelining/ - but beware this may again get you in more trouble with the site's operator.

Categories

Resources