Django 404 Email spam - python

I recently launched a web application based Django, and have been very pleased with its results. I also turned on a feature in Django where you can have emails sent to MANAGERS for 404's by adding the middleware 'django.middleware.common.BrokenLinkEmailsMiddleware'. However, ever since I did that, I'm getting LOTS of spam requests hitting 404s. I'm not sure if they are bots or what but this is the information I'm getting from Django:
Referrer: http://34.212.239.19/index.php
Requested URL: /index.php
User agent: Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)
IP address: 172.31.23.16
Why am I getting requests to URL's that don't exist on my site and is there a way to filter out requests so I don't get emails for them? These URL's have never existed on my site (my site is very recently launched). I'm getting roughly 50-100 emails a day from spam requests to my site.

I can't imagine an automated way of filtering out spam as a non-existent URL is indistinguishable from a spam URL, but you can filter out usual suspects using IGNORABLE_404_URLS:
List of compiled regular expression objects describing URLs that should be ignored when reporting HTTP 404 errors via email (see Error reporting). Regular expressions are matched against request's full paths (including query string, if any). Use this if your site does not provide a commonly requested file such as favicon.ico or robots.txt.
For example:
import re
IGNORABLE_404_URLS = [
re.compile(r'\.(php|cgi)$'),
re.compile(r'^/phpmyadmin/'),
]

Related

Django failing to load 400.html

I've defined custom templates for errors 400, and 404 for my Django project. When I try to access the production version of my site, the error 404 template is correctly loaded for missing pages. However, if I send a bad request to my Apache/Django server (e.g. http://mysite.example.com/%), the template for the error 400 is not loaded, instead, the regular Apache error page is rendered:
Bad Request
Your browser sent a request that this server could not understand.
Apache/2.4.18 (Ubuntu) Server at mysite.example.com Port 80
Is apache relaying this request to Django at all, or do I need to define handler400 in my Django project in order for this to work (though I didn't have to do that for the 404.html)?
The crucial point here is that your apache is acting as a proxy for your usgi server. It's forwarding all valid requests to usgi, a request for a non existent request is a valid request as far as apache is concerned and needs the forwarded to the django router - which will find that the url mapping does not exist and raise a 404 error. This error is done internally by django and results in the django 404 page being shown.
Some requests, most notably the django rest framework produce 400 responses internally when the serializers fail to validate the incoming json request. Those will also result in the django 400 page being shown.
However if the request itself is malformed, it will never be forwarded to the usgi server and django will never see it. it will be handled internally by apache hence the reason that the apache 400 html is shown.
The simplest solution would be to replace all the apache error pages with the corresponding django one (if these are templates, render them and save the html)

AppEngine Python urlfetch() fails with 416 error, same query succeeds in a browser

I'm dusting off an app that worked a few months ago. I've made no changes. Here's the code in question:
result = urlfetch.fetch(
url=url,
deadline=TWENTY_SECONDS)
if result.status_code != 200: # pragma: no cover
logging.error('urlfetch failed.')
logging.error('result.status_code = %s' % result.status_code)
logging.error('url =')
logging.error(url)
Here's the output:
WARNING 2015-04-20 01:13:46,473 urlfetch_stub.py:118] No ssl package found. urlfetch will not be able to validate SSL certificates.
ERROR 2015-04-20 01:13:46,932 adminhandlers.py:84] urlfetch failed. url =
ERROR 2015-04-20 01:13:46,933 adminhandlers.py:85] http://www.stubhub.com/listingCatalog/select/?q=%2Bevent_date%3A%5BNOW%20TO%20NOW%2B1DAY%5D%0D%0A%2BancestorGeoDescriptions:%22New%20York%20Metro%22%0D%0A%2BstubhubDocumentType%3Aevent&version=2.2&start=0&rows=1&wt=json&fl=name_primary+event_date_time_local+venue_name+act_primary+ancestorGenreDescriptions+description
When I use a different url, e.g., "http://www.google.com/", the fetch succeeds.
When I paste the url string from the output into Chrome I get this response, which is the one I'm looking for:
{"responseHeader":{"status":0,"QTime":19,"params":{"fl":"name_primary event_date_time_local venue_name act_primary ancestorGenreDescriptions description","start":"0","q":"+event_date:[NOW TO NOW+1DAY]\r\n+ancestorGeoDescriptions:\"New York Metro\"\r\n+stubhubDocumentType:event +allowedViewingDomain:stubhub.com","wt":"json","version":"2.2","rows":"1"}},"response":{"numFound":26,"start":0,"docs":[{"act_primary":"Waka Flocka Flame","description":"Waka Flocka Flame Tickets (18+ Event)","event_date_time_local":"2015-04-20T20:00:00Z","name_primary":"Webster Hall","venue_name":"Webster Hall","ancestorGenreDescriptions":["All tickets","Concert tickets","Artists T - Z","Waka Flocka Flame Tickets"]}]}}
I hope I'm missing something simple. Any suggestions?
Update May 30, 2015
Anzel's suggestion of Apr 23 was correct. I need to add a user agent header. The one supplied by the AppEngine dev server is
AppEngine-Google; (+http://code.google.com/appengine)
The one supplied by hosted AppEngine is
AppEngine-Google; (+http://code.google.com/appengine; appid: s~MY_APP_ID)
The one supplied by requests.get() in pure Python (no AppEngine) on MacOS is
python-requests/2.2.1 CPython/2.7.6 Darwin/14.3.0
When I switch in the Chrome user agent header all is well in pure Python. Stubhub must have changed this since I last tried it. Curious that they would require an interactive user agent for a service that emits JSON, but I'm happy they offer the service at all.
When I add that header in AppEngine, though, AppEngine prepends it to its own user-agent header. Stubhub then turns down the request.
So I've made some progress, but have not yet solved my problem.
FYI:
In AppEngine I supply the user agent like this:
result = urlfetch.fetch(
url=url,
headers = {'user-agent': USER_AGENT_STRING}
)
This is a useful site for determining the user agent string your code or browser is sending:
http://myhttp.info/
I don't have priveledges yet to post comments, so here goes.
Look at the way you are entering the URL into the var 'url'. Is it already encoded as the error message says? I would try to make sure the url is a regular, non-encoded one, and test that, perhaps the library is re-encoding it again, causing problems. If you could give us more surrounding code, that may help in our diagnosis.

https site with Django in text browser throws CSRF verification failed

I have a Django site that works well on a server using HTTPS protocol, I can use it with no problem with all kind of browsers.
The thing is that every time I try to use a text browser, I get a
Forbidden (403)
CSRF verification failed. Request aborted.
You are seeing this message because this HTTPS site requires a 'Referer header' to be sent by your Web browser, but none was sent.
This header is required for security reasons, to ensure that your browser is not being hijacked by third parties.
If you have configured your browser to disable 'Referer' headers, please re-enable them, at least for this site, or for HTTPS
connections, or for 'same-origin' requests.
Help
Reason given for failure:
Referer checking failed - no Referer.
I have tried links, lynx, even w3m and eww on emacs, to no avail.
When I use a HTTP site (like when I'm using the manage.py runserver) I can use the site on text browsers with no problem, but my production server needs a HTTPS protocol and that's when I get this error.
[ EDIT: just for testing purposes, I deployed an HTTP server for my django site on the production server. It works well on text browsers... ]
[ EDIT: given the message the server throws, why are Referer headers not been given? ]
Lynx is likely configured to not send the Referer header. Check /etc/lynx.cfg for "REFERER".
There are entries like NO_REFERER_HEADER. Make sure that's set to false. If that's not it, check around in that config for any other disabled referer headers.
Also related, the CSRF and Referer header debate: https://code.djangoproject.com/ticket/16870
Are you setting SECURE_PROXY_SSL_HEADER, SESSION_COOKIE_SECURE and CSRF_COOKIE_SECURE in your settings?
https://docs.djangoproject.com/en/1.7/topics/security/#ssl-https

How to programmatically retrieve editing history pages from MusicBrainz using python?

I'm trying to programmatically retrieve editing history pages from the MusicBrainz website. (musicbrainzngs is a library for the MB web service, and editing history is not accessible from the web service). For this, I need to login to the MB website using my username and password.
I've tried using the mechanize module, and using the login page second form (first one is the search form), I submit my username and password; from the response, it seems that I successfully login to the site; however, a further request to an editing history page raises an exception:
mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt
I understand the exception and the reason for it. I take full responsibility for not abusing the site (after all, any usage will be tagged with my username), I just want to avoid manually opening a page, saving the HTML and running a script on the saved HTML. Can I overcome the 403 error?
The better solution is to respect the robots.txt file and simply download the edit data itself and not screen scrape MusicBrainz. You can down load the complete edit history here:
ftp://ftp.musicbrainz.org/pub/musicbrainz/data/fullexport
Look for the file mbdump-edit.tar.bz2.
And, as the leader of the MusicBrainz team, I would like to ask you to respect robots.txt and download the edit data. Thats one of the reasons why we make the edit data downloadable.
Thanks!
If you want to circumvent the site's robots.txt, you can achieve this by telling your mechanize.Browser to ignore the robots.txt file.
br = mechanize.Browser()
br.set_handle_robots(False)
Additionally, you might want to alter your browser's user agent so you dont look like a robot:
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
Please be aware that when doing this, you're actually tricking the website into thinking you're a valid client.

log into website (specifically netflix) with python

I am trying to log into netflix with python, would work perfectly but i cant get it to detect weather or not login failed, the code looks like this:
#this is not purely my code! Thanks to Ori for the code
import urllib
username = raw_input('Enter your email: ')
password = raw_input('Enter your password: ')
params = urllib.urlencode(
{'email': username,
'password': password })
f = urllib.urlopen("https://signup.netflix.com/Login", params)
if "The login information you entered does not match an account in our records. Remember, your email address is not case-sensitive, but passwords are." in f.read():
success = False
print "Either your username or password was incorrect."
else:
success = True
print "You are now logged into netflix as", username
raw_input('Press enter to exit the program')
As always, many thanks!!
First, I'll just share some verbiage I noticed on the Netflix site under Limitations on Use:
Any unauthorized use of the Netflix service or its contents will terminate the limited license granted by us and will result in the cancellation of your membership.
In short, I'm not sure what your script does after this, but some activities could jeopardize your relationship with Netflix. I did not read the whole ToS, but you should.
That said, there are plenty of legitimate reasons to scrape html information, and I do it all the time. So my first bet with this specific problem is you're using the wrong detection string... Just send a bogus email/password and print the response... Perhaps you made an assumption about what it looks like when you log in with a browser, but the browser is sending info that gets further into the process.
I wish I could offer specifics on what to do next, but I would rather not risk my relationship with 'flix to give a better answer to the question... so I'll just share a few observations I gleaned from scraping oodles of other websites that made it kindof hard to use web robots...
First, login to your account with Firefox, and be sure to have the Live HTTP Headers add-on enabled and in capture mode... what you will see when you login live is invaluable to your scripting efforts... for instance, this was from a session while I logged in...
POST /Login HTTP/1.1
Host: signup.netflix.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.16) Gecko/20110319 Firefox/3.6.16
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Referer: https://signup.netflix.com/Login?country=1&rdirfdc=true
--->Insert lots of private stuff here
Content-Type: application/x-www-form-urlencoded
Content-Length: 168
authURL=sOmELoNgTeXtStRiNg&nextpage=&SubmitButton=true&country=1&email=EmAiLAdDrEsS%40sOmEMaIlProvider.com&password=UnEnCoDeDpAsSwOrD
Pay particular attention to the stuff below "Content-Length" field and all the parameters that come after it.
Now log back out, and pull up the login site page again... chances are, you will see some of those fields hidden as state information in <input type="hidden"> tags... some web apps keep state by feeding you fields and then they use javascript to resubmit that same information in your login POST. I usually use lxml to parse the pages I receive... if you try it, keep in mind that lxml prefers utf-8, so I include code that automagically converts when it sees other encodings...
response = urlopen(req,data)
# info is from the HTTP headers... like server version
info = response.info().dict
# page is the HTML response
page = response.read()
encoding = chardet.detect(page)['encoding']
if encoding != 'utf-8':
page = page.decode(encoding, 'replace').encode('utf-8')
BTW, Michael Foord has a very good reference on urllib2 and many of the assorted issues.
So, in summary:
Using your existing script, dump the results from a known bogus login to be sure you're parsing for the right info... I'm pretty sure you made a bad assumption above
It also looks like you aren't submitting enough parameters in the POST. Experience tells me you need to set authURL in addition to email and password... if possible, I try to mimic what the browser sends...
Occasionally, it matters whether you have set your user-agent string and referring webpage. I always set these when I scrape so I don't waste cycles debugging.
When all else fails, look at info stored in cookies they send
Sometimes websites base64 encode form submission data. I don't know whether Netflix does
Some websites are very protective of their intellectual property, and programatically reading/archiving the information is considered a theft of their IP. Again, read the ToS... I don't know how Netflix views what you want to do.
I am providing this for informational purposes and under no circumstances endorse, or condone the violation of Netflix terms of service... nor can I confirm whether your proposed activity would... I'm just saying it might :-). Talk to a lawyer that specializes in e-discovery if you need an official ruling. Feet first. Don't eat yellow snow... etc...

Categories

Resources