from urllib.request import *
import urllib
def read_text():
text = open("/home/pizzapablo666/Desktop/Test")
contents_of_file = text.read()
print(contents_of_file)
text.close()
check_profanity(contents_of_file)
def check_profanity(text_to_check):
text_to_check = urllib.parse.quote_plus(text_to_check)
connection = urlopen(
"http://www.wdylike.appspot.com/?q=" + text_to_check)
output = connection.read()
print(output)
connection.close()
read_text()
THis is updated version
HTTP 400 error bad request, what is the cause ? and how can I fix this error?
I think it is because you are not encoding your string before appending it to your url.
For example, in python3 you should do the following to 'text_to_check' before appending it to your url:
text_to_check = urllib.parse.quote_plus(text_to_check)
Python2 would be something like this (urllib was broken into smaller components in python3):
text_to_check = urllib.quote_plus(text_to_check)
This means that, when appending a string with whitespace to your url it will appear as something like "Am+I+cursing%3F" instead of "Am I cursing?".
Full check_profanity() example:
def check_profanity(text_to_check):
text_to_check = urllib.parse.quote_plus(text_to_check)
connection = urlopen(
"http://www.wdylike.appspot.com/?q=" + text_to_check)
output = connection.read()
print(output)
connection.close()
Related
I am using requests library (python 3.9) to get filename from URL.[1] For some reason a file name is incorrectly encoded.
I should get "Ogłoszenie_0320.pdf" instead of "OgÅ\x82oszenie_0320.pdf".
My code looks something like this:
import requests
import re
def getFilenameFromRequest(url : str, headers):
# Parses from header information
contentDisposition = headers.get('content-disposition')
if contentDisposition:
filename = re.findall('filename=(.+)', contentDisposition)
print("oooooooooo: " + contentDisposition + " : " + str(filename))
if len(filename) != 0:
return filename[0]
# Parses from url
parsedUrl = urlparse(url)
return os.path.basename(parsedUrl.path)
def getFilenameFromUrl(url : str):
request = requests.head(url)
headers = request.headers
return getFilenameFromRequest(url, headers)
getFilenameFromUrl('https://przedszkolekw.bip.gov.pl'+
'/fobjects/download/880287/ogloszenie-uzp-nr-613234-pdf.html')
Any idea how to fix it?
I know for standard request I can set encoding directly:
request.encoding = 'utf-8'
But what am I supposed to do with this case?
[1]
https://przedszkolekw.bip.gov.pl/fobjects/download/880287/ogloszenie-uzp-nr-613234-pdf.html
Only characters from the ascii based latin-1 should be used as header values [rfc]. Here the file name has been escaped.
>>> s = "Ogłoszenie_0320.pdf"
>>> s.encode("utf8").decode("unicode-escape")
'OgÅ\x82oszenie_0320.pdf'
To reverse the process you can do
>>> sx = 'OgÅ\x82oszenie_0320.pdf'
>>> sx.encode("latin-1").decode("utf8")
'Ogłoszenie_0320.pdf'
(updated after conversation in comments)
I have the following code:
#!/usr/bin/python
import time, uuid, hmac, hashlib, base64, json
import urllib3
import certifi
import datetime
import requests
import re
from datetime import datetime
http = urllib3.PoolManager(
cert_reqs='CERT_REQUIRED', # Force certificate check.
ca_certs=certifi.where(), # Path to the Certifi bundle.
)
#Get the status response from pritunl api
BASE_URL = 'https://www.vpn.trimble.cloud:443'
API_TOKEN = 'gvwrfQZQPryTbX3l03AQMwTyaE0aFywE'
API_SECRET = 'B0vZp5dDyOrshW1pmFFjAnIUyeGtFy9y'
LOG_PATH = '/var/log/developer_vpn/'
def auth_request(method, path, headers=None, data=None):
auth_timestamp = str(int(time.time()))
auth_nonce = uuid.uuid4().hex
auth_string = '&'.join([API_TOKEN, auth_timestamp, auth_nonce,
method.upper(), path] + ([data] if data else []))
auth_signature = base64.b64encode(hmac.new(
API_SECRET, auth_string, hashlib.sha256).digest())
auth_headers = {
'Auth-Token': API_TOKEN,
'Auth-Timestamp': auth_timestamp,
'Auth-Nonce': auth_nonce,
'Auth-Signature': auth_signature,
}
if headers:
auth_headers.update(headers)
return http.request(method, BASE_URL + path, headers=auth_headers, body=data)
response1 = auth_request('GET',
'/server',
)
if response1.status == 200:
pritunlServResponse = (json.loads(response1.data))
#print pritunlServResponse
#print response1.data
Name = [y['name'] for y in pritunlServResponse]
Server_id = [x['id'] for x in pritunlServResponse]
for srv_name, srv_id in zip(Name, Server_id):
response2 = auth_request('GET',
'/server/' + srv_id + '/output',
)
pritunlServResponse2 = (json.loads(response2.data))
py_pritunlServResponse2 = pritunlServResponse2['output']
print("value of srv_id: ", srv_id, "\n")
print("value of srv_name: ", srv_name, "\n")
logfile = open(LOG_PATH + srv_name +'_vpn_out.log', 'w')
for log in py_pritunlServResponse2:
if re.search(r'(?!52\.39\.62\.8)', log):
logfile.write("%s\n" % log)
logfile.close()
else:
raise SystemExit
This code visits a website using authentication (the address has been redacted), grabs some text formatted in JSON, and parses two values from the output: "srv_name" and "srv_id". This code then uses the "srv_id" to construct additional HTTP requests to get log files from the server. It then grabs the log files - one for each "srv_id" and names them with the values obtained from "srv_name" and saves them on the local system.
I want to do some additional grep-style processing before the files are written to the local system. Specifically I'd like to exclude any text exactly containing "52.39.62.8" from being written. When I run the code above, it looks like the regex is not being processed as I still see "52.39.62.8" in my output files.
If the IP address is always flanked by specific characters, e.g.: (52.39.62.8):, you can use in for exact contains:
if '(52.39.62.8):' not in log:
logfile.write(log + '\n')
re.search(r'(?!52\.39\.62\.8)', log)
You're matching any empty string that is not followed by the ip address - every string will match, as this will match the end of any string.
reverse your logic and output the line to the log only if re.search for the ip address comes back as None.
if re.search(r'(?<!\d)52\.39\.62\.8(?!\d)', log) is None:
logfile.write("%s\n" % log)
note that this also includes it's own negative look-behind and look-ahead assertions to ensure no digits precede or follow the ip address.
This is a code with Web crawler.
I'm a beginer in learning python.So i don't know how to solve.
It seems wrong with search()
# -*- coding:utf-8 -*-
import urllib,urllib2,re
class BDTB:
def __init__(self,baseUrl,seeLZ):
self.baseUrl = baseUrl
self.seeLZ = '?see_lz' + str(seeLZ)
def getPage(self,pageNum):
try:
url = self.baseUrl + self.seeLZ + '&pn=' + str(pageNum)
request = urllib2.Request(url)
response = urllib2.urlopen(request)
#print response.read().decode('utf-8')
return response
except urllib2.URLError,e:
if hasattr(e,'reason'):
print u'连接百度贴吧失败,错误原因',e.reason
return None
def getTitle(self):
page = self.getPage(1)
pattern = re.compile('<h3 class.*?px">(.*?)</h3>',re.S)
result = re.search(pattern,page)
if result:
print result.group(1)
return result.group(1).strip()
else:
return None
baseURL = 'http://tieba.baidu.com/p/4095047339'
bdtb = BDTB(baseURL,1)
bdtb.getTitle()
This will raise a TypeError: expected string or buffer because you are passing the object returned from urllib2.urlopen(request) to re.search() when it requires an str.
If you change the return value from:
return responce # returns the object
to one that returns the text contained in the request:
return responce.read() # returns the text contained in the responce
Your script works and after executing it returns:
广告兼职及二手物品交易集中贴
Additionally, since you're working with Python 2.x you might want to change you object from class BDTB: to class BDTB(object) in order to use new style classes.
I'm trying to strip the characters b,'().
The issue I'm having is that it says TypeError 'str' does not support the buffer interface.
Here are the relevant parts of code in this:
import urllib3
def command_uptime():
http = urllib3.PoolManager()
r = http.request('GET', 'https://nightdev.com/hosted/uptime.php?channel=TrippedNW')
rawData = r.data
liveTime = bytes(rawData.strip("b,\'()", rawData))
message = "Tripped has been live for: ", liveTime
send_message(CHAN, message)
What you have is binary data. Its not a string. You need to decode it first.
Also you don't need to pass rawData to itself in strip method.
import urllib3
def command_uptime():
http = urllib3.PoolManager()
r = http.request('GET', 'https://nightdev.com/hosted/uptime.php?channel=TrippedNW')
strData = r.data.decode('utf-8')
liveTime = strData.strip("b,\'()")
message = "Tripped has been live for: %s" % liveTime
print(message)
command_uptime()
Be also aware that your message variable is a tuple not a string. I dont know if send_message expects this. I formatted it into a single string.
Just remove the second argument.
import urllib3
def command_uptime():
http = urllib3.PoolManager()
r = http.request('GET', 'https://nightdev.com/hosted/uptime.php?channel=TrippedNW')
rawData = r.data
liveTime = bytes(rawData.strip("b,'()"))
print("Tripped has been live for: %s" % liveTime)
command_uptime()
Output:
Tripped has been live for: 1 hour, 18 minutes
EDIT:(SOLVED) When I am reading the values in from my file a newline char is getting added onto the end.(\n) this is splitting my request string at that point.
I think it's to do with how I saved the values to the file in the first place. Many thanks.
I have I have the following code:
results = 'http://www.myurl.com/'+str(mystring)
print str(results)
request = urllib2.Request(results)
request.add_header('User-Agent','Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)')
opener = urllib2.build_opener()
text = opener.open(request).read()
Which is in a loop.
after the loop has run a few times str(mystring) changes to give a different set of results.
I can loop the script as many times as I like keeping the value of str(mystring) constant but every time I change the value of str(mystring) I get an error saying no host given when the code tries to build the opener.
opener = urllib2.build_opener()
Can anyone help please?
TIA,
Paul.
EDIT:
More code here.....
import sys
import string
import httplib
import urllib2
import re
import random
import time
def StripTags(text):
finished = 0
while not finished:
finished = 1
start = text.find("<")
if start >= 0:
stop = text[start:].find(">")
if stop >= 0:
text = text[:start] + text[start+stop+1:]
finished = 0
return text
mystring="test"
d={}
with open("myfile","r") as f:
while True:
page_counter=0
print str(mystring)
try:
while page_counter <20:
results = 'http://www.myurl.com/'+str(mystring)
print str(results)
request = urllib2.Request(results)
request.add_header('User-Agent','Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)')
opener = urllib2.build_opener()
text = opener.open(request).read()
finds = (re.findall('([\w\.\-]+'+mystring+')',StripTags(text)))
for find in finds:
d[find]=1
uniq_emails=d.keys()
page_counter = page_counter +1
print "found this " +str(finds)"
random.seed()
n = random.random()
i = n * 5
print "Pausing script for " + str(i) + " Seconds" + ""
time.sleep(i)
mystring=next(f)
except IOError:
print "No result found!"+""
I found the answer. It's as follows....
The values for mystring were read in from a file.
In the script I wrote to write the file I opens it with "w" instead of "wb".
Each line in the file ended with a newline character "/n".
When mystring was added to the string request the new line was being created in the middle of the request string.[1]
This would never have been apparent from my code because I changed it to post here in an effort to hide the real url I am using to get my results.[2]
My actual url looks more like this.....
Myurl.com/mystring/otherstuff/page_counter/morestuff.htm
The /n being read from the file spliced my url and gave urllib problems......
[1] I use windows. It adds lots of unseen things to text files. If I'd opened the file to write to with "wb" instead of "w" the contents would have been written without the unseen /n
[2] always post your full code kids. The good people of stackoverflow can't help you unless they can see what you are doing.....
Many thanks all, I hope this helps someone out at some point.
Paul.
In the while loop, you're setting results to something which is not a url:
results = 'myurl+str(mystring)'
It should probably be results = myurl+str(mystring)
By the way, it appears there's no need for all the casting to string (str()) you do:
(expanded on request)
print str(foo): in such a case, str() is never necessary. Python will always print foo's string representation
results = 'http://www.myurl.com/'+str(mystring). This is also unnecessary; mystring is already a string, so 'http://www.myurl.com/' + mystring would suffice.
print "Pausing script for " + str(i) + " Seconds". Here you would get an error without str() since you can't do string + int. However, print "foo", 1, "bar" does work. As do print "foo %i bar" % 1 and print "foo {0} bar".format(1) (see here)