IP address revealed behind Tor - python

I am using Tor servers to route the requests of my crawler, which is multithreaded but nonetheless very easy on loading since I make each thread sleep for a random normal time with a mean of 20 seconds (approx 3 requests a minute). I need to get first google search result for some 20,000 odd queries. My crawler is scripted in python using urllib2 (socks proxy) and mechanize (http proxy).
# Snippet of code initializing the urllib2 build_opener
host = socks_hostname
port = socks_port
socks_username = username
socks_password = password
cj = cookielib.CookieJar()
br = urllib2.build_opener(SocksiPyHandler(socks.PROXY_TYPE_SOCKS5, host, port,
username=socks_username,
password=socks_password),
urllib2.HTTPCookieProcessor(cj))
# Get randomly generated User-Agent string.
br.addheaders = [('User-Agent', self.get_user_agent())]
return br
I just discovered that Tor network isn't hiding my IP as far as google is concerned. I wrote a small test script to check the ip address from google and from http://whatismyip.net. While whatismyip.net seems to get some ip based from Canada, Google shows my real ip, this confuses me. I have made sure that I don't have any cookies that can be tracked.
What is even more puzzling is that, when I use the tor in my firefox, then google shows a random ip based in Canada as well. So, it's only when I send automated requests, that my real ip gets exposed, can someone help me figure out what is causing this leak?
I understand crawling is a sensitive topic, but the rate of my crawling is actually slower than a human being!

Related

I want to change my ip address without using vpn or proxy

I scraping some pages and these pages check my IP if it is a vpn or proxy (fake IP) if it is found fake the site is blocking my request please if there is a way to change my IP every x time with real IP Without using vpn or proxy or restart router
Note: I am using a Python script for this process
You IPAddress is fixed by your internet service provider, if you reset your home router, u sometimes can take another IPAddress depending on various internal questions.
Some Websites, block by the User-Agent, IP GeoLocation of your request or by rate limit.. but if u sure its is by IP, so the only way to swap your IPAddress is through by VPNTunneling or ProxyMesh.
You can obtain free proxy address from https://www.freeproxylists.net/ . Since these are free proxies so it may get down quickly so sometime you might need to rotate ip with each request you made to your target address.
You can set proxy address, Please follow up this question, how to set proxy, Proxies with Python 'Requests' module
So the flow would be:
Scrape the proxies from above address first.
Then add the proxy header as mentioned in the another question.
Rotate Ip with another request to target.
There are certain blocking factor not only your ip.
Like browser agent (https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/?sfw=pass1637120088).
Too rigorous scraping (try to randomize timing of scraping between two requests).
Not following up robots.txt file (this sometime cant be avoided).

How to fix "No connection could be made because the target machine actively refused it" through a firewall

I am writing a basic Python program with the end goal of scraping data from websites for data processing/streamlining. It works fine when I'm not connected to the company network, however when I try when I'm connected to the company network it does not work.
I have connected to an "alternative" network at work, with fewer restrictions, and it works fine. However this is not a long term solution as connection to this network means I do not have access to my files and email, which I need.
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.google.com')
soup = BeautifulSoup(page.content,'html.parser')
print(soup.prettify())
as others said, if you are able to load google in your browser while you are connected to company network then its most likely because your company uses proxy.
Requests module supports proxy so you can use obe.

Access web site by location in python

Some web site have a pt., en. in the beginner or .br, .it at the end because of the server location.
When I use the library of python as the function urlopen I have to pass the full adress string of the web site, including the termination string of the server location (for international servers).
Some international web sites have the each country service. There some way to python make this transparent to the user? (adding the termination or starter string) Because some webpage to not redirect to the local proximity server in an automatic way.
If you try to access google.com and google decides to forward you
automatically to google.se (for example), there's nothing the client
can do about it - whether that client is a human or a python script.
That is controlled by the webserver, not the client.
What Danielle said in the comment is not entirely correct, when the client access the webpage "google.com", the site host noticed your ip location and send back a signal telling the browser to redirect the current site to "google.se" (To go with Danielle's example) to make the site match your ip location. However, you can avoid redirects. As for the sake of the question, here's a simple demonstration using python Requests library. Setting allow_redirects to False.
import requests
r = requests.get('https://www.google.com')
print(r.url)
# 'https://www.google.ca/?gfe_rd=cr&dcr=0&ei=mpewWZGdGePs8we597n4Dw'
# requests automatically followed the redirect link to google.ca
r = requests.get('https://www.google.com', allow_redirects=False)
print(r.url)
# 'https://www.google.com/'
# here it says at google.com
Your question isn't clear enough to provide a more thorough answer. But I hope my example has helped you a bit.

Python script through the Tor network

I have written a simple python script that fetches my ip.
import urllib
import socks
import socket
#set the proxy and port
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 9150)
#initialize the socket
socket.socket = socks.socksocket
#store the URL that we want
url = 'https://check.torproject.org/'
#open the URL and store it into 'response'
response = urllib.urlopen(url)
#parse the response
html = response.read()
#print to console
print html
Nothing too complex, however the problem starts when analyzing the response from check.torbrowser. The site will always give me an address that is different from my currently running Tor browser that is on the same page. However, the html response will say that I am being routed through the Tor network but it doesnt look to be coming from the 'standard' tor browser. The latter part I understand, though I did not include it in the code above, I was playing with User-Agent strings and other headers, so I will chalk it up to that being the primary cause. What I do not understand is where in the h-e-double hockey sticks did the IP come from that was served as a response from the py script?
My next question, which builds on top of all this, is how do I connect my python script to the tor network correctly? After a little googling, I found that tor will block traffic for everything other than the socks protocol and that an alternative is to use privoxy in conjunction with tor. My initial thought is to do some kind of routing that would result in the layering of software. In my mind, it would look like:
Python -> Privoxy -> Tor -> Destination
My end goal in all of this is to grab a .onion based address and save/read it. However, I have put that to the side after all of these problems started occurring. A little info to help get better answers: I am using a Windows machine, though I have a Linux one if there is some functionality that may be present there that would help this process, and I am using Python 2.7 though, again, this can be easily changed.
I would like to ask that the steps to make all this happen be laid out - or at least some links/direction, I am by no means afraid to read a few good blogs/tutorials about the subject. However, I feel like this is really a couple of seperate questions, and would require quiet a lengthy answer so I would be more than happy to just know that I am on the right path before I rip more of my hair out :)
Your code is correct, however your assumption that Tor will always give you the same IP address is not. Thanks to circuit isolation, a privacy feature of Tor that ensures isolation between the connections you open, you're routing the request through a different exit node than the Tor Browser will.
Reliably emulating the Tor Browser behavior is hard and I would recommend against it. Your method for connecting to the Tor network looks correct.
Tor will allow you to use any protocol you want, but yes you need to connect through the SOCKS protocol. That's fine though: almost all network protocols (http included) play nicely with SOCKS.
With torpy library you can renew circuits as you wish.
>>> from torpy.http.requests import TorRequests
>>>
>>> def show_ip(resp):
... for line in resp.text.splitlines():
... if 'Your IP address appears to be' in line:
... print(line)
...
>>> with TorRequests() as tor_requests:
... print("build circuit")
... with tor_requests.get_session() as sess:
... show_ip(sess.get("https://check.torproject.org/"))
... show_ip(sess.get("https://check.torproject.org/"))
... print("renew circuit")
... with tor_requests.get_session() as sess:
... show_ip(sess.get("https://check.torproject.org/"))
... show_ip(sess.get("https://check.torproject.org/"))
...
build circuit
<p>Your IP address appears to be: <strong>178.17.171.102</strong></p>
<p>Your IP address appears to be: <strong>178.17.171.102</strong></p>
renew circuit
<p>Your IP address appears to be: <strong>49.50.66.209</strong></p>
<p>Your IP address appears to be: <strong>49.50.66.209</strong></p>

Python urllib2 anonymity through tor

I have been trying to use SocksiPy (http://socksipy.sourceforge.net/) and set my sockets with SOCKS5 and set it to go through a local tor service that I am running on my box.
I have the following:
socks.setdefausocks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "localhost", 9050, True)
socket.socket = socks.socksocket
import urllib2
And I am doing something similar to:
workItem = "http://192.168.1.1/some/stuff" #obviously not the real url
req = urllib2.Request(workItem)
req.add_header('User-agent', 'Mozilla 5.10')
res = urllib2.urlopen(req, timeout=60)
And even using this I have been identified by the website, my understanding was that I would be coming out of a random end point every time and it wouldn't be able to identify me. And I can confirm if I hit whatsmyip.org with this that my end point is different every time. Is there some other steps I have to take to keep anonymous? I am using an IP address in the url so it shouldn't be doing any DNS resolution that might give it away.
There is no such User-Agent 'Mozilla 5.10' in reality. If the server employs even the simplest fingerprinting based on the User-Agent it will identity you based on this uncommon setting.
And I don't think you understand TOR: it does not provide full anonymity. It only helps by providing anonymity by hiding you real IP address. But it does not help if you give your real name on a web site or use such easily detectable features like an uncommon user agent.
You might have a look at the Design and Implementation Notes for the TOR browser bundle to see what kind of additional steps they take to be less detectable and where they still see open problems. You might also read about Device Fingerprinting which is used to identity the seemingly anonymous peer.

Categories

Resources