Python scraping TOR, script "To Russia, with love"

Python scraping TOR, script "To Russia, with love" - python

I am trying scrape with BS4 via TOR, using the To Russia With Love tutorial from the Stem project.
I've rewritten the code a bit, using i.a. this answer, and it now looks like this,
SOCKS_PORT=7000
def query(url):
output = io.BytesIO()
query = pycurl.Curl()
query.setopt(pycurl.URL, url)
query.setopt(pycurl.PROXY, 'localhost')
query.setopt(pycurl.PROXYPORT, SOCKS_PORT)
query.setopt(pycurl.PROXYTYPE, pycurl.PROXYTYPE_SOCKS5_HOSTNAME)
query.setopt(pycurl.WRITEFUNCTION, output.write)
try:
query.perform()
return output.getvalue()
except pycurl.error as exc:
return "Unable to reach %s (%s)" % (url, exc)
def print_bootstrap_lines(line):
if "Bootstrapped " in line:
print(term.format(line, term.Color.BLUE))
print(term.format("Starting Tor:\n", term.Attr.BOLD))
tor_process = stem.process.launch_tor_with_config(
tor_cmd = '/Applications/TorBrowser.app/Contents/MacOS/Tor/tor.real',
config = {
'SocksPort': str(SOCKS_PORT),
'ExitNodes': '{ru}',
'GeoIPFile': r'/Applications/TorBrowser.app/Contents/Resources/TorBrowser/Tor/geoip',
'GeoIPv6File' : r'/Applications/TorBrowser.app/Contents/Resources/TorBrowser/Tor/geoip6'
},
init_msg_handler = print_bootstrap_lines,
)
print(term.format("\nChecking our endpoint:\n", term.Attr.BOLD))
print(term.format(query("https://www.atagar.com/echo.php"), term.Color.BLUE))
I am able to Establish a Tor circuit, but at "checking our endpoint", I receive a the following error,
Checking our endpoint:
Traceback (most recent call last):
File "<ipython-input-804-68f8df2c050b>", line 40, in <module>
print(term.format(query('https://www.atagar.com/echo.php'), term.Color.BLUE))
File "/Applications/anaconda/lib/python3.6/site-packages/stem/util/term.py", line 139, in format
if RESET in msg:
TypeError: a bytes-like object is required, not 'str'
What should I change to see the endpoint?
I've temporarily solved it by changing the last line of the above code with,
test=requests.get('https://www.atagar.com/echo.php')
soup = BeautifulSoup(test.content, 'html.parser')
print(soup)
but I'd like to know how to get the 'original' line working.

You must be using Python 3, when that code was made for Python 2. In Python 2, str and bytes are the same thing, and in Python 3, str is Python 2's unicode. You have to add a b directly before the string to make it a byte string in Python 3, e.g.:
b"this is a byte string"

Related

How do I get the Swagger-generated Python client to work?

I have generated the python client and server from https://editor.swagger.io/ - and the server runs correctly with no editing, but I can't seem to get the client to communicate with it - or with anything.
I suspect I'm doing something really silly but the examples I've found on the Internet either don't work or appear to be expecting that I understand how to craft the object. Here's my code (I've also tried sending nothing, a string, etc):
import time
import swagger_client
import json
from swagger_client.rest import ApiException
from pprint import pprint
# Configure OAuth2 access token for authorization: petstore_auth
swagger_client.configuration.access_token = 'special-key'
# create an instance of the API class
api_instance = swagger_client.PetApi()
d = '{"id": 0,"category": {"id": 0,"name": "string"},"name": "doggie","photoUrls": ["string"], "tags": [ { "id": 0, "name": "string" } ], "status": "available"}'
python_d = json.loads(d)
print( json.dumps(python_d, sort_keys=True, indent=4) )
body = swagger_client.Pet(python_d) # Pet | Pet object that needs to be added to the store
try:
# Add a new pet to the store
api_instance.add_pet(body)
except ApiException as e:
print("Exception when calling PetApi->add_pet: %s\n" % e)
I'm using python 3.6.4 and when the above runs I get:
Traceback (most recent call last):
File "petstore.py", line 14, in <module>
body = swagger_client.Pet(python_d) # Pet | Pet object that needs to be added to the store
File "/Users/bombcar/mef/petstore/python-client/swagger_client/models/pet.py", line 69, in __init__
self.name = name
File "/Users/bombcar/mef/petstore/python-client/swagger_client/models/pet.py", line 137, in name
raise ValueError("Invalid value for `name`, must not be `None`") # noqa: E501
ValueError: Invalid value for `name`, must not be `None`
I feel I'm making an incredibly basic mistake, but I've literally copied the JSON from https://editor.swagger.io/ - but since I can't find an actually working example I don't know what I'm missing.

The Python client generator produces object-oriented wrappers for the API. You cannot post a dict or a JSON string directly, you need to create a Pet object using the generated wrapper:
api_instance = swagger_client.PetApi()
pet = swagger_client.Pet(name="doggie", status="available",
photo_urls=["http://img.example.com/doggie.png"],
category=swagger_client.Category(id=42))
response = api_instance.add_pet(pet)

I got similar issue recently and finally fixed by upgrading python version in my local. I generated python-flask-server zip file from https://editor.swagger.io/. Then I set up the environment locally with py 3.6.10. I got same error when parsing input by using "Model_Name.from_dict()", telling me
raise ValueError("Invalid value for `name`, must not be `None`") # noqa: E501
ValueError: Invalid value for `name`, must not be `None`
Then I upgraded to python 3.7.x, the issue was resolved. I know that your problem is not related to version, however, just in case anyone got similar issue and seeking for help could see this answer.

Python 3.5 / Pastebin "Bad API request, invalid api_option"

I'm working on a twitch irc bot and one of the components I wanted to have available was the ability for the bot to save quotes to a pastebin paste on close, and then retrieve the same quotes on start up.
I've started with the saving part, and have hit a road block where I can't seem to get a valid post, and I can't figure out a method.
#!/usr/bin/env python3
import urllib.parse
import urllib.request
# --------------------------------------------- Pastebin Requisites --------------------------------------------------
pastebin_key = 'my pastebin key' # developer api key, required. GET: http://pastebin.com/api
pastebin_password = 'password' # password for pastebin_username
pastebin_postexp = 'N' # N = never expire
pastebin_private = 0 # 0 = Public 1 = unlisted 2 = Private
pastebin_url = 'http://pastebin.com/api/api_post.php'
pastebin_username = 'username' # user corresponding with key
# --------------------------------------------- Value clean up --------------------------------------------------
pastebin_password = urllib.parse.quote(pastebin_password, safe='/')
pastebin_username = urllib.parse.quote(pastebin_username, safe='/')
# --------------------------------------------- Pastebin Functions --------------------------------------------------
def post(title, content): # used for posting a new paste
pastebin_vars = {'api_option': 'paste', 'api_user_key': pastebin_username, 'api_paste_private': pastebin_private,
'api_paste_name': title, 'api_paste_expire_date': pastebin_postexp, 'api_dev_key': pastebin_key,
'api_user_password': pastebin_password, 'api_paste_code': content}
try:
str_to_paste = ', '.join("{!s}={!r}".format(key, val) for (key, val) in pastebin_vars.items()) # dict to str :D
str_to_paste = str_to_paste.replace(":", "") # remove :
str_to_paste = str_to_paste.replace("'", "") # remove '
str_to_paste = str_to_paste.replace(")", "") # remove )
str_to_paste = str_to_paste.replace(", ", "&") # replace dividers with &
urllib.request.urlopen(pastebin_url, urllib.parse.urlencode(pastebin_vars)).read()
print('did that work?')
except:
print("post submit failed :(")
print(pastebin_url + "?" + str_to_paste) # print the output for test
post("test", "stuff")
I'm open to importing more libraries and stuff, not really sure what I'm doing wrong after working on this for two days straight :S

import urllib.parse
import urllib.request
PASTEBIN_KEY = 'xxx'
PASTEBIN_URL = 'https://pastebin.com/api/api_post.php'
PASTEBIN_LOGIN_URL = 'https://pastebin.com/api/api_login.php'
PASTEBIN_LOGIN = 'my_login_name'
PASTEBIN_PWD = 'yyy'
def pastebin_post(title, content):
login_params = dict(
api_dev_key=PASTEBIN_KEY,
api_user_name=PASTEBIN_LOGIN,
api_user_password=PASTEBIN_PWD
)
data = urllib.parse.urlencode(login_params).encode("utf-8")
req = urllib.request.Request(PASTEBIN_LOGIN_URL, data)
with urllib.request.urlopen(req) as response:
pastebin_vars = dict(
api_option='paste',
api_dev_key=PASTEBIN_KEY,
api_user_key=response.read(),
api_paste_name=title,
api_paste_code=content,
api_paste_private=2,
)
return urllib.request.urlopen(PASTEBIN_URL, urllib.parse.urlencode(pastebin_vars).encode('utf8')).read()
rv = pastebin_post("This is my title", "These are the contents I'm posting")
print(rv)
Combining two different answers above gave me this working solution.

First, your try/except block is throwing away the actual error. You should almost never use a "bare" except clause without capturing or re-raising the original exception. See this article for a full explanation.
Once you remove the try/except, and you will see the underlying error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "paste.py", line 42, in post
urllib.request.urlopen(pastebin_url, urllib.parse.urlencode(pastebin_vars)).read()
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 461, in open
req = meth(req)
File "/usr/lib/python3.4/urllib/request.py", line 1112, in do_request_
raise TypeError(msg)
TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.
This means you're trying to pass a unicode string into a function that's expecting bytes. When you do I/O (like reading/writing files on disk, or sending/receiving data over HTTP) you typically need to encode any unicode strings as bytes. See this presentation for a good explanation of unicode vs. bytes and when you need to encode and decode.
Next, this line:
urllib.request.urlopen(pastebin_url, urllib.parse.urlencode(pastebin_vars)).read()
Is throwing away the response, so you have no way of knowing the result of your API call. Assign this to a variable or return it from your function so you can then inspect the value. It will either be a URL to the paste, or an error message from the API.
Next, I think your code is sending a lot of unnecessary parameters to the API and your str_to_paste statements aren't necessary.
I was able to make a paste using the following, much simpler, code:
import urllib.parse
import urllib.request
PASTEBIN_KEY = 'my-api-key' # developer api key, required. GET: http://pastebin.com/api
PASTEBIN_URL = 'http://pastebin.com/api/api_post.php'
def post(title, content): # used for posting a new paste
pastebin_vars = dict(
api_option='paste',
api_dev_key=PASTEBIN_KEY,
api_paste_name=title,
api_paste_code=content,
)
return urllib.request.urlopen(PASTEBIN_URL, urllib.parse.urlencode(pastebin_vars).encode('utf8')).read()
Here it is in use:
>>> post("test", "hello\nworld.")
b'http://pastebin.com/v8jCkHDB'

I didn't know about pastebin until now. I read their api and tried it for the first time, and it worked perfectly fine.
Here's what I did:
I logged in to fetch the api_user_key.
Included that in the posting along with api_dev_key.
Checked the website, and the post was there.
Here's the code:
import urllib.parse
import urllib.request
def post(url, params):
data = urllib.parse.urlencode(login_params).encode("utf-8")
req = urllib.request.Request(login_url, data)
with urllib.request.urlopen(req) as response:
return response.read()
# Logging in to fetch api_user_key
login_url = "http://pastebin.com/api/api_login.php"
login_params = {"api_dev_key": "<the dev key they gave you",
"api_user_name": "<username goes here>",
"api_user_password": "<password goes here>"}
api_user_key = post(login_url, login_params)
# Posting some random text
post_url = "http://pastebin.com/api/api_post.php"
post_params = {"api_dev_key": "<the dev key they gave you",
"api_option": "paste",
"api_paste_code": "<head>Testing</head>",
"api_paste_private": "0",
"api_paste_name": "testing.html",
"api_paste_expire_date": "10M",
"api_paste_format": "html5",
"api_user_key": api_user_key}
response = post(post_url, post_params)
Only the first three parameters are needed for posting something, the rest are optional.

fwy the API doesn't seem to accept http requests as of writing this, so make sure to have the urls in the format of https://pas...

YouTube API UnicodeEncodeError in Python 3.4

I was exploring the YouTube Data API and finding that improperly encoded results were holding me back. I got good results until I retrieve a set that includes unmapped characters in the titles. My code is NOW (cleaned up a little for you fine folks):
import urllib.request
import urllib.parse
import json
import datetime
# Look for videos published up to THIS MANY hours ago
IntHoursToSub = 2
RightNow = datetime.datetime.utcnow()
StartedAgo = datetime.timedelta(hours=-(IntHoursToSub))
HourAgo = RightNow + StartedAgo
HourAgo = str(HourAgo).replace(" ", "T")
HourAgo = HourAgo[:HourAgo.find(".")] + "Z"
# Get API Key from your safe place and set up the API link
YouTubeAPIKey = open('YouTubeAPIKey.txt', 'r').read()
locuURL = "https://www.googleapis.com/youtube/v3/search"
values = {"key": YouTubeAPIKey,
"part": "snippet",
"publishedAfter": HourAgo,
"relevanceLanguage": "en",
"regionCode": "us",
"maxResults": "50",
"type": "live"}
postData = urllib.parse.urlencode(values)
fullURL = locuURL + "?" + postData
# Set up response holder and handle exceptions
respData = ""
try:
req = urllib.request.Request(fullURL)
respData = urllib.request.urlopen(req).read().decode()
except Exception as e:
print(str(e))
#print(respData)
# Read JSON response and iterate through for video names/URLs
jsonData = json.loads(respData)
for object in jsonData["items"]:
if object["id"]["kind"] == "youtube#video":
print(object["snippet"]["title"], "https://www.youtube.com/watch?v=" + object["id"]["videoId"])
The error was:
Traceback (most recent call last):
File "C:/Users/Chad LaFarge/PycharmProjects/APIAccess/YouTubeAPI.py", line 33, in <module>
print(respData)
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u25bb' in position 11737: character maps to <undefined>
UPDATE
MJY Called it! Starting from PyCharm menu bar:
File -> Settings... -> Editor -> File Encodings, then set: "IDE Encoding", "Project Encoding" and "Default encoding for properties files" ALL to UTF-8 and she now works like a charm.
Many thanks!

Check the sys.stdout.encoding.
If this is not UTF-8, the problem is not in YouTube API.
Please check such as environment variables PYTHONIOENCODING, terminal and locale settings.

Strange failure to make a HIT for Amazon Mechanical Turk with some URLs?

I was trying to include a link in a HIT request in Amazon Mechanical Turk, using boto, and kept getting an error that my XML was invalid. I gradually pared my html down to the bare minimum, and isolated that it seems to be that some valid links fail for seemingly no reason. Can anyone with expertise in boto or aws help me parse why?
I followed these two guides:
http://www.toforge.com/2011/04/boto-mturk-tutorial-create-hits/
https://gist.github.com/j2labs/740267
Here is my example:
from boto.mturk.connection import MTurkConnection
from boto.mturk.question import QuestionContent,Question,QuestionForm,Overview,AnswerSpecification,SelectionAnswer,FormattedContent,FreeTextAnswer
from config import *
HOST = 'mechanicalturk.sandbox.amazonaws.com'
mtc = MTurkConnection(aws_access_key_id=ACCESS_ID,
aws_secret_access_key=SECRET_KEY,
host=HOST)
title = 'HIT title'
description = ("HIT description.")
keywords = 'keywords'
s1 = """<![CDATA[<p>Here comes a link <a href='%s'>LINK</a></p>]]>""" % "http://www.example.com"
s2 = """<![CDATA[<p>Here comes a link <a href='%s'>LINK</a></p>]]>""" % "https://www.google.com/search?q=example&site=imghp&tbm=isch"
def makeahit(s):
overview = Overview()
overview.append_field('Title', 'HIT title itself')
overview.append_field('FormattedContent',s)
qc = QuestionContent()
qc.append_field('Title','The title')
fta = FreeTextAnswer()
q = Question(identifier="URL",
content=qc,
answer_spec=AnswerSpecification(fta))
question_form = QuestionForm()
question_form.append(overview)
question_form.append(q)
mtc.create_hit(questions=question_form,
max_assignments=1,
title=title,
description=description,
keywords=keywords,
duration = 30,
reward=0.05)
makeahit(s1) # SUCCESS!
makeahit(s2) # FAIL?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 25, in makeahit
File "/usr/local/lib/python2.7/dist-packages/boto/mturk/connection.py", line 263, in create_hit
return self._process_request('CreateHIT', params, [('HIT', HIT)])
File "/usr/local/lib/python2.7/dist-packages/boto/mturk/connection.py", line 821, in _process_request
return self._process_response(response, marker_elems)
File "/usr/local/lib/python2.7/dist-packages/boto/mturk/connection.py", line 836, in _process_response
raise MTurkRequestError(response.status, response.reason, body)
boto.mturk.connection.MTurkRequestError: MTurkRequestError: 200 OK
<?xml version="1.0"?>
<CreateHITResponse><OperationRequest><RequestId>19548ab5-034b-49ec-86b2-9e499a3c9a79</RequestId></OperationRequest><HIT><Request><IsValid>False</IsValid><Errors><Error><Code>AWS.MechanicalTurk.XHTMLParseError</Code><Message>There was an error parsing the XHTML data in your request. Please make sure the data is well-formed and validates against the appropriate schema. Details: The reference to entity "site" must end with the ';' delimiter. Invalid content: <FormattedContent><![CDATA[<p>Here comes a link <a href='https://www.google.com/search?q=example&site=imghp&tbm=isch'>LINK</a></p>]]></FormattedContent> (1369323038698 s)</Message></Error></Errors></Request></HIT></CreateHITResponse>
Any idea why s2 fails, but s1 succeeds when both are valid links? Both link contents work:
http://www.example.com
https://www.google.com/search?q=example&site=imghp&tbm=isch
Things with query strings? Https?
UPDATE
I'm going to do some tests, but right now my candidate hypotheses are:
HTTPS doesn't work (so, I'll see if I can get another https link to work)
URLs with params don't work (so, I'll see if I can get another url with params to work)
Google doesn't allow its searches to get posted this way? (if 1 and 2 fail!)

You need to escape ampersands in urls, i.e. & => &.
At the end of s2, use
q=example&site=imghp&tbm=isch
instead of
q=example&site=imghp&tbm=isch

In python, what is a character buffer?

I am a beginner SQLite user and has run into some trouble, hoping to find someone who could help.
I am trying to read some data out of a database, put it into a variable in python and print it out onto a HTML page.
The table inside the database is calle "Status", it contains two columns "stamp" and "messages". "stamp is an INT containing a time stamp, and "messages" contains a TEXT.
#cherrypy.expose
def comment(self, ID = None):
con = lite.connect('static/database/Status.db')
output = ""
with con:
cur = con.cursor()
cur.execute("SELECT * FROM Status WHERE stamp = ?", (ID,))
temp = cur.fetchone()
output = temp[0]
comments = self.readComments(ID)
page = get_file(staticfolder+"/html/commentPage.html")
page = page.replace("$Status", output)
The error I am getting reads:
Traceback (most recent call last):
File "/usr/lib/pymodules/python2.7/cherrypy/_cprequest.py", line 606, in respond
cherrypy.response.body = self.handler()
File "/usr/lib/pymodules/python2.7/cherrypy/_cpdispatch.py", line 25, in __call__
return self.callable(*self.args, **self.kwargs)
File "proj1base.py", line 184, in comment
page = page.replace("$Status", output)
TypeError: expected a character buffer object
I was wondering if anyone could help me clarify what a character buffer object is, and how do i use one in order for my code to work?

Replace "character buffer" by "string" for starters. (There are more types exposing the "buffer protocol" in Python, but don't bother with them for now.) Most likely, output ended up not being a string. Log its type in the line before the error.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python scraping TOR, script "To Russia, with love" - python

You must be using Python 3, when that code was made for Python 2. In Python 2, str and bytes are the same thing, and in Python 3, str is Python 2's unicode. You have to add a b directly before the string to make it a byte string in Python 3, e.g.: b"this is a byte string"

Related

How do I get the Swagger-generated Python client to work?

Python 3.5 / Pastebin "Bad API request, invalid api_option"

YouTube API UnicodeEncodeError in Python 3.4

Strange failure to make a HIT for Amazon Mechanical Turk with some URLs?

In python, what is a character buffer?

Categories

Resources