Beautiful Soup Can't Redact Phone Number with Parentheses

Beautiful Soup Can't Redact Phone Number with Parentheses - python

I'm trying to redact phone number information from an html file ... and while I can identify all of the phone numbers easily enough I can't figure out why I am unable to replace the phone numbers that have parentheses in them. Sample below:
import re
from bs4 import BeautifulSoup
text = '''<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted#gmail.com</li>
<li>Phones: (555) 555-5555</li>
</ul><b>Category:</b> <ul><li>Title 2 </li><li>Fake Info</li></ul>
City, MO 11111 | (555) 111-1111 | myemail#gmail.com
Some Category / Some Name: 555-222-2222 | Record Number#:
</html>'''
soup = BeautifulSoup(text, 'html.parser')
def find_phone_numbers(text):
phones = re.findall(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})", text)
return phones
phones = find_phone_numbers(str(soup))
print(phones)
for i in phones:
target = soup.find_all(text=re.compile(i, re.I))
try:
for v in target:
v.replace_with(v.replace(i,'(XXX) XXX-XXXX'))
except TypeError:
pass;
print(soup)
These are my results from running the above:
['(555) 555-5555', '(555) 111-1111', '555-222-2222']
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted#gmail.com</li>
<li>Phones: (555) 555-5555</li>
</ul><b>Category:</b> <ul><li>Title 2 </li><li>Fake Info</li></ul>
City, MO 11111 | (555) 111-1111 | myemail#gmail.com
Some Category / Some Name: (XXX) XXX-XXXX | Record Number#:
</div></body></html>

You can use .find_all(text=True) to obtain all text content from the HTML soup, and then replace it with re.sub (that way, you preserve all tags, including <li>):
for content in soup.find_all(text=True):
s = re.sub(r'(\(?\d{3}\)?)([\s.-]*)(\d{3})([\s.-]*)(\d{4})', '(XXX) XXX-XXXX', content)
content.replace_with(s)
print(soup)
Prints:
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted#gmail.com</li>
<li>Phones: (XXX) XXX-XXXX</li>
</ul><b>Category:</b> <ul><li>Title 2 </li><li>Fake Info</li></ul>
City, MO 11111 | (XXX) XXX-XXXX | myemail#gmail.com
Some Category / Some Name: (XXX) XXX-XXXX | Record Number#:
</div></body></html>

Slight change of approach. Get all li tags, then for each tag, replace the phone numbers with your mask, if a phone number exists. I have used an interim variable for that (temp_text), just to keep the code a bit more readable.
all_li=soup.find_all('li')
for li in all_li:
temp_text=re.sub(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})", '(XXX) XXX-XXXX', li.text)
if temp_text:
li.replace_with(temp_text)
print(soup) output:

Related

Pytesseract return nothing in Urdu and Arabic text

Converting Id Card Image to text by using Pytesseract. Till yet I've break the image in section for name address Id card number and parse it using
import pytesseract as tess
from PIL import Image
im = Image.open("Image.jpg")
crop_rectangle = (20, 320, 400, 400)
cropped_im = im.crop(crop_rectangle)
text = tess.image_to_string(cropped_im, lang='ara')
print(text)
The result is blank.
In additional I've also tried
text = tess.image_to_pdf_or_hocr(cropped_im, lang='ara', extension='hocr')
And this addition step returns
b'<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<meta name=\'ocr-system\' content=\'tesseract v5.0.0-alpha.20191030\' />
<meta name=\'ocr-capabilities\' content=\'ocr_page ocr_carea ocr_par ocr_line ocrx_word
ocrp_wconf\'/>
</head>
<body>
<div class=\'ocr_page\' id=\'page_1\' title=\'image
"C:\\Users\\MOHSIN~1.IFT\\AppData\\Local\\Temp\\tess_za20zk94.PNG"; bbox 0 0 380 80; ppageno 0\'>
<div class=\'ocr_carea\' id=\'block_1_1\' title="bbox 0 0 380 80">
<p class=\'ocr_par\' id=\'par_1_1\' lang=\'ara\' title="bbox 0 0 380 80">
<span class=\'ocr_line\' id=\'line_1_1\' title="bbox 0 0 380 80; baseline 0 0; x_size 108;
x_descenders 27; x_ascenders 27">
<span class=\'ocrx_word\' id=\'word_1_1\' title=\'bbox 0 0 380 80; x_wconf 95\'> </span>
</span>
</p>
</div>
</div>
</body>
</html>'
Need help to convert Urdu/Arabic Image into text
Thank you in Advance

Why Beautifulsoup is getting weird source code characters while downloading a web-page?

I'm new in python and web-crawling. I'm doing a few exercises crawling some web-sites and seems great with beautifulsoup. But recently, while I was crawling a Persian site (https://video.varzesh3.com) with the code below, I am receiving weird characters. I have done the same procedure on other Persian websites an I believe the problem is not with the encoding. This is my code:
url = 'https://video.varzesh3.com'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
print(soup)
And this is a part of the result:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<title>ÙÛØ¯ÛÙ ÙØ±Ø²Ø´ Ø³Ù | Ø®Ø§ÙÙ</title>
<meta content="sport ,varzesh ,football, soccer,livescores, live score, livescore, iran,football3,Daily soccer news , broadcast ,ÙÙØªØ¨Ø§Ù Ø³Ù , ÙØªØ§ÛØ¬ Ø²ÙØ¯Ù , Ø®ÙÛØ¬ ÙØ§Ø±Ø³ , perian gulf , ÙÛÚ¯ Ø¢Ø²Ø§Ø¯Ú¯Ø§Ù , ÙÙØ±ÙØ§ÙØ§Ù Ø¢Ø³ÛØ§ , ÙÙØ±ÙØ§ÙØ§Ù Ø§Ø±ÙÙ¾Ø§ , ÙÛÚ¯ Ø¨Ø±ØªØ± , Ø¬Ø§Ù ØØ°ÙÛ , Ø´Ø¨Ú©Ù Ø³Ù , ÙØ±Ø²Ø´ , ÙÙØªØ¨Ø§Ù Ø¨Ø±ØªØ± , Ø§ÛØ±Ø§Ù , Ø¬Ø§Ù Ø¬ÙØ§ÙÛ , Ø¬Ø§Ù Ø¬ÙØ§ÙÛ 2010,ÙÙØªØ¨Ø§Ù 3 ," name="keywords">
<meta content="Ù¾Ø§ÙÚ¯Ø§Ù ÙÛØ¯ÛÙ ÙØ±Ø²Ø´Û Ø¨Ø±Ø§Ù ÙØ§Ø±Ø³Ù Ø²Ø¨Ø§ÙØ§Ù ÙÙ ÙÛØ¯ÛÙ ØÙØ²Ù ÙØ±Ø²Ø´ (ÙÙØªØ¨Ø§ÙØÙØ§ÙÙØ¨Ø§Ù ØØ¨Ø³ÙØªØ¨Ø§Ù Ù...) Ø±Ø§ Ø§Ø±Ø§Ø¦Ù ÙÛ Ú©ÙØ¯" name="description">
<link href="/Static/css/frontend.min.css?v=9" rel="stylesheet" type="text/css"/>
<link href="https://static2.farakav.com/v3/static/css/fonts.css?version=6" rel="stylesheet" type="text/css"/>
<link href="https://static2.farakav.com/varzesh3/assets/font/varzesh3-icon/varzesh3.min.css" rel="stylesheet" type="text/css"/>
<script src="https://static2.farakav.com/football3_jscripts/jquery-1.8.0.min.js" type="text/javascript"></script>
<script src="/Static/js/jquery.cookie.js" type="text/javascript"></script>
<script src="/Static/js/mustache.js" type="text/javascript"></script>
<script type="text/javascript">
now = new Date();
var head = document.getElementsByTagName('head')[0];
var script = document.createElement('script');
script.type = 'text/javascript';
var script_address = 'https://cdn.yektanet.com/js/varzesh3.com/article.v1.min.js';
script.src = script_address + '?v=' + now.getFullYear().toString() + '0' + now.getMonth() + '0' + now.getDate() + '0' + now.getHours();
head.appendChild(script);
</script>
Why do I get this weird like "Ù¾Ø§ÙÚ¯Ø§Ù ÙÛØ¯ÛÙ ÙØ±Ø²Ø´Û Ø¨Ø±Ø§Ù ÙØ§Ø±Ø³" characters?

You get those wired characters because of it's encoding.
>>> source_code.encoding
'ISO-8859-1'
Try this, set encoding to UTF-8
>>> source_code.encoding = 'UTF-8'
>>> plain_text = source_code.text
>>> BeautifulSoup(plain_text, "html.parser")
Output:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<title>ویدیو ورزش سه | خانه</title>
<meta content="sport ,varzesh ,football, soccer,livescores, live score, livescore, iran,football3,Daily soccer news , broadcast ,فوتبال سه , نتایج زنده , خلیج فارس , perian gulf , لیگ آزادگان , قهرمانان آسیا , قهرمانان اروپا , لیگ برتر , جام حذفی , شبکه سه , ورزش , فوتبال برتر , ایران , جام جهانی , جام جهانی 2010,فوتبال 3 ," name="keywords">
<meta content="پايگاه ویدیو ورزشی براي فارسي زبانان كه ویدیو حوزه ورزش (فوتبال،واليبال ،بسكتبال و...) را ارائه می کند" name="description">
<link href="/Static/css/frontend.min.css?v=9" rel="stylesheet" type="text/css"/>
....
...
..

Why is text of HTML node empty with HTMLParser?

In the following example I am expecting to get Foo for the <h2> text:
from io import StringIO
from html5lib import HTMLParser
fp = StringIO('''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h2>
<span class="section-number">1. </span>
Foo
<a class="headerlink" href="#foo">¶</a>
</h2>
</body>
</html>
''')
etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
h2 = etree.findall('.//h2')[0]
h2.text
Unfortunately I get ''. Why?
Strangly, foo is in the text:
>>> list(h2.itertext())
['1. ', 'Foo', '¶']
>>> h2.getchildren()
[<Element 'span' at 0x7fa54c6a1bd8>, <Element 'a' at 0x7fa54c6a1c78>]
>>> [node.text for node in h2.getchildren()]
['1. ', '¶']
So where is Foo?

I think you are one level too shallow in the tree. Try this:
from io import StringIO
from html5lib import HTMLParser
fp = StringIO('''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h2>
<span class="section-number">1. </span>
Foo
<a class="headerlink" href="#foo">¶</a>
</h2>
</body>
</html>
''')
etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
etree.findall('.//h2')[0][0].tail
More generally, to crawl all text and tail, try a loop like this:
for u in etree.findall('.//h2')[0]:
print(u.text, u.tail)

Using lxml:
fp2 = '''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h2>
<span class="section-number">1. </span>
Foo
<a class="headerlink" href="#foo">¶</a>
</h2>
</body>
</html>
'''
import lxml.html
tree = lxml.html.fromstring(fp2)
for item in tree.xpath('//h2'):
target = item.text_content().strip()
print(target.split('\n')[1].strip())
Output:
Foo

Python Transcrypt addEventListener

I have written a Python program for translation with Transcrypt to Javascript.
I can not get the addEventListener function to work. Any ideas?
Here is the code as dom7.py:
class TestSystem:
def __init__ (self):
self.text = 'Hello, DOM!'
self.para = 'A new paragraph'
self.textblock = 'This is an expandable text block.'
self.button1 = document.getElementById("button1")
self.button1.addEventListener('mousedown', self.pressed)
def insert(self):,
document.querySelector('output').innerText = self.text
# document.querySelector('test').innerText = "Test"+self.button1+":"
def pressed(self):
container = document.getElementById('textblock')
newElm = document.createElement('p')
newElm.innerText = self.para
container.appendChild(newElm)
testSystem = TestSystem()
And here follows the corresponding dom7.html for it:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<script src="__javascript__/dom7.js"></script>
<title>Titel</title>
</head>
<body onload=dom7.testSystem.insert()>
<button id="button1">Click me</button><br>
<main>
<h1>DOM examples</h1>
<p>Testing DOM</p>
<p>
<output></output>
</p>
<p>
<test>Test String:</test>
</p>
<div id="textblock">
<p>This is an expandable text block.</p>
</div>
</main>
</body>
</html>

The problem is that your TestSystem constructor is called before the DOM tree is ready. There are three ways to deal with this, the last of which is the best.
The first way is to include your script after you populated your body:
class TestSystem:
def __init__ (self):
self.text = 'Hello, DOM!'
self.para = 'A new paragraph'
self.textblock = 'This is an expandable text block.'
self.button1 = document.getElementById("button1")
self.button1.addEventListener('mousedown', self.pressed)
def insert(self):
document.querySelector('output').innerText = self.text
# document.querySelector('test').innerText = "Test"+self.button1+":"
def pressed(self):
container = document.getElementById('textblock')
newElm = document.createElement('p')
newElm.innerText = self.para
container.appendChild(newElm)
testSystem = TestSystem()
and:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Titel</title>
</head>
<body onload=dom7.testSystem.insert()>
<button id="button1">Click me</button><br>
<main>
<h1>DOM examples</h1>
<p>
Testing DOM
</p>
<p>
<output></output>
</p>
<p>
<test>Test String:</test>
</p>
<div id="textblock">
<p>This is an expandable text block.</p>
</div>
<script src="__javascript__/dom7.js"></script>
</main>
</body>
</html>
Still your insert function may be called too early, so may not work.
The second way is to include the script at the beginning and call an initialization function to connect event handlers to the DOM:
class TestSystem:
def __init__ (self):
self.text = 'Hello, DOM!'
self.para = 'A new paragraph'
self.textblock = 'This is an expandable text block.'
self.button1 = document.getElementById("button1")
self.button1.addEventListener('mousedown', self.pressed)
def insert(self):
document.querySelector('output').innerText = self.text
# document.querySelector('test').innerText = "Test"+self.button1+":"
def pressed(self):
container = document.getElementById('textblock')
newElm = document.createElement('p')
newElm.innerText = self.para
container.appendChild(newElm)
def init ():
testSystem = TestSystem()
and:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<script src="__javascript__/dom7.js"></script>
<title>Titel</title>
</head>
<body onload=dom7.testSystem.insert()>
<button id="button1">Click me</button><br>
<main>
<h1>DOM examples</h1>
<p>
Testing DOM
</p>
<p>
<output></output>
</p>
<p>
<test>Test String:</test>
</p>
<div id="textblock">
<p>This is an expandable text block.</p>
</div>
<script>dom7.init ();</script>
</main>
</body>
</html>
Still there is a possibility that some browsers call the initialization function before the page is loaded, although this is rare. In addition to this the insert method is again called too early.
Third and best way, to solve both problems, is to run your initialization after a page load event, and call insert after you create your testSystem, so e.g. in the initalization function:
class TestSystem:
def __init__ (self):
self.text = 'Hello, DOM!'
self.para = 'A new paragraph'
self.textblock = 'This is an expandable text block.'
self.button1 = document.getElementById("button1")
self.button1.addEventListener('mousedown', self.pressed)
def insert(self):
document.querySelector('output').innerHTML = self.text
# document.querySelector('test').innerText = "Test"+self.button1+":"
def pressed(self):
container = document.getElementById('textblock')
newElm = document.createElement('p')
newElm.innerText = self.para
container.appendChild(newElm)
def init ():
testSystem = TestSystem()
testSystem.insert ()
and:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<script src="__javascript__/dom7.js"></script>
<title>Titel</title>
</head>
<body onload="dom7.init ();">
<button id="button1">Click me</button><br>
<main>
<h1>DOM examples</h1>
<p>
Testing DOM
</p>
<p>
<output></output>
</p>
<p>
<test>Test String:</test>
</p>
<div id="textblock">
<p>This is an expandable text block.</p>
</div>
</main>
</body>
</html>

I looked at your mondrian example in the tutorial section and I saw that there is also a very simple way to attach the addEventListener to a document after it has loaded. You can use the DOMContentLoaded attribute in the header of the html doc for doing so:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<script src="__javascript__/addEventListener_example1.js"></script>
<script>document.addEventListener("DOMContentLoaded", addEventListener_example1.init)</script>
<title>Titel</title>
</head>
<body>
<button id="button1">Click me</button><br>
<main>
<h1>DOM examples</h1>
<p>
Testing DOM
</p>
<p>
<output></output>
</p>
<p>
<test>Test String:</test>
</p>
<div id="textblock">
<p>This is an expandable text block.</p>
</div>
</main>
</body>
</html>
and the code for addEventListener_example1.py would be:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
def init():
insert()
def insert():
document.querySelector('output').innerHTML = 'Hello, DOM!'
button1 = document.getElementById("button1")
button1.addEventListener('mousedown', pressed)
def pressed():
para = 'A new paragraph'
container = document.getElementById('textblock')
newElm = document.createElement('p')
newElm.innerText = para
container.appendChild(newElm)

CloudFlare Access Denied while running Download and Parse Script

I am dealing with a legal issue, and built a script so I didn't have to search a website by hand.
Script:
import sys, urllib
servno = 2000
servernomax = 2676
alldat = ""
while True:
newdat = ""
url = "http://coc-servers.com/servers/"+str(servno)
wp = str(urllib.urlopen(url).read())
print wp
ind1 = wp.find('"IP: "')
if ind1 != -1:
ind1 += 7
ind2 = wp.find('http',ind1)
ind3 = wp.find('"',ind2)
IPurl = wp[ind2:ind3]
newdat += IPurl
ind4 = wp.find("<th>Webiste</th>")
if ind4 != -1:
ind4 +=22
ind5 = wp.find('http',ind4)
ind6 = wp.find('"',ind5)
Website = wp[ind5:ind6]
newdat += ", "
newdat += Website
alldat += newdat
servno +=1
#print ind1, ind4
if servno > 2676: break
print alldat
sys.exit()
Bug free, however some values need tweaking.
The output?
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]> <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]> <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>
<title>Access denied | coc-servers.com used CloudFlare to restrict access</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" type="text/css" media="screen,projection" />
<!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-->
<style type="text/css">body{margin:0;padding:0}</style>
<!--[if lte IE 9]><script type="text/javascript" src="/cdn-cgi/scripts/jquery.min.js"></script><![endif]-->
<!--[if gte IE 10]><!--><script type="text/javascript" src="/cdn-cgi/scripts/zepto.min.js"></script><!--<![endif]-->
<script type="text/javascript" src="/cdn-cgi/scripts/cf.common.js"></script>
</head>
<body>
<div id="cf-wrapper">
<div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>
<div id="cf-error-details" class="cf-error-details-wrapper">
<div class="cf-wrapper cf-header cf-error-overview">
<h1>
<span class="cf-error-type" data-translate="error">Error</span>
<span class="cf-error-code">1010</span>
<small class="heading-ray-id">Ray ID: 24730841e07509a6 • 2015-11-18 10:36:04 UTC</small>
</h1>
<h2 class="cf-subheadline" data-translate="error_desc">Access denied</h2>
</div><!-- /.header -->
<section></section><!-- spacer -->
<div class="cf-section cf-wrapper">
<div class="cf-columns two">
<div class="cf-column">
<h2 data-translate="what_happened">What happened?</h2>
<p>The owner of this website (coc-servers.com) has banned your access based on your browser's signature (24730841e07509a6-ua48).</p>
</div>
</div>
</div><!-- /.section -->
<div class="cf-error-footer cf-wrapper">
<p>
<span class="cf-footer-item">CloudFlare Ray ID: <strong>24730841e07509a6</strong></span>
<span class="cf-footer-separator">•</span>
<span class="cf-footer-item"><span data-translate="your_ip">Your IP</span>: 64.18.227.167</span>
<span class="cf-footer-separator">•</span>
<span class="cf-footer-item"><span data-translate="performance_security_by">Performance & security by</span> <a data-orig-proto="https" data-orig-ref="www.cloudflare.com/5xx-error-landing?utm_source=error_footer" id="brand_link" target="_blank">CloudFlare</a></span>
</p>
</div><!-- /.error-footer -->
</div><!-- /#cf-error-details -->
</div><!-- /#cf-wrapper -->
<script type="text/javascript">
window._cf_translation = {};
</script>
</body>
</html>
Alright, so it wor- wait.. What? Access Denied? I have been banned? Based on my browser?
How can I get around this? I'm aware CloudFlare was built to prevent DDoSing, but, this is not a DDoS at all.
I would try implementing a delay, however, the first through last response is the same message.
Would implementing multiple browser agents and a delay fix it, or am I done for?

Following the docs over at http://wolfprojects.altervista.org/articles/change-urllib-user-agent/ , I was successfully able to run the script without error, or cloudflare banning me.
The new script is:
import sys
from urllib import FancyURLopener
servno = 2224 #2000
servernomax = 2676
alldat = ""
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
mopen = MyOpener()
while True:
newdat = ""
url = "http://coc-servers.com/servers/"+str(servno)
wp = str(mopen.open(url).read())#str(urlopen(url).read())
#print wp
ind1 = wp.find('IP: ')
if ind1 != -1:
ind1 += 7
ind2 = wp.find('http',ind1)
ind3 = wp.find('"',ind2)
IPurl = wp[ind2:ind3]
newdat += IPurl
ind4 = wp.find("<th>Website</th>")
if ind4 != -1:
ind4 +=22
ind5 = wp.find('http',ind4)
ind6 = wp.find('"',ind5)
Website = wp[ind5:ind6]
newdat += ", "
newdat += Website
newdat += ";;; "
alldat += newdat
servno +=1
#print ind1, ind4
if servno > 2676: break
print alldat
sys.exit()
Who knew FancyURLOpener would be so useful? :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautiful Soup Can't Redact Phone Number with Parentheses - python

Related

Pytesseract return nothing in Urdu and Arabic text

Why Beautifulsoup is getting weird source code characters while downloading a web-page?

Why is text of HTML node empty with HTMLParser?

Python Transcrypt addEventListener

CloudFlare Access Denied while running Download and Parse Script

Categories

Resources