Scrapy - use multiple IP Addresses for a host - python

Wasn't able to find anything in the docs/SO relating to my question.
So basically I'm crawling a website with 8 or so subdomains
They are all using Akamai/CDN.
My question is if I can find Ips of a few different Akamai data centres can, I somehow explicitly say this subdomain should use this ip for the host name etc.. So basically override the auto dns resolving...
As this would allow greater efficiency and I would imagine less likely to be throttled as I'd be distributing the crawling.
Thanks

You can just set your DNS names manually in your hosts file. On windows this can be found at C:\Windows\System32\Drivers\etc\hosts and on Linux in /etc/hosts

Related

I want to change my ip address without using vpn or proxy

I scraping some pages and these pages check my IP if it is a vpn or proxy (fake IP) if it is found fake the site is blocking my request please if there is a way to change my IP every x time with real IP Without using vpn or proxy or restart router
Note: I am using a Python script for this process
You IPAddress is fixed by your internet service provider, if you reset your home router, u sometimes can take another IPAddress depending on various internal questions.
Some Websites, block by the User-Agent, IP GeoLocation of your request or by rate limit.. but if u sure its is by IP, so the only way to swap your IPAddress is through by VPNTunneling or ProxyMesh.
You can obtain free proxy address from https://www.freeproxylists.net/ . Since these are free proxies so it may get down quickly so sometime you might need to rotate ip with each request you made to your target address.
You can set proxy address, Please follow up this question, how to set proxy, Proxies with Python 'Requests' module
So the flow would be:
Scrape the proxies from above address first.
Then add the proxy header as mentioned in the another question.
Rotate Ip with another request to target.
There are certain blocking factor not only your ip.
Like browser agent (https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/?sfw=pass1637120088).
Too rigorous scraping (try to randomize timing of scraping between two requests).
Not following up robots.txt file (this sometime cant be avoided).

Get gTLD or ccTLD from IP address

There are many questions on SO related to fetching an IP address from URL, but not vice versa.
As the title suggests, I would like to get the website URL of its respective IP address. For instance:
>>> import socket
>>> print(socket.gethostbyname('google.com'))
This looks up the domain and returns 172.217.20.14. I am looking for the counter part like e.g.:
>>> print(socket.getnamebyhost('172.217.20.14'))
Anything similar that would return the domain as google.com for the IP specified.
Is this possible to do in python3?
If yes, how can this be achieved?
UPDATE
Unfortunately, the way I'm approaching this is wrong. There are IPs that share a one-to-many relationship i.e. the nameserver points to numerous urls, unless the PTR record indicates otherwise. My question rephrased:
How do IP-to-domain data providers like ipinfo.io return
top-level domains for a single IP?
To my understanding, the A or AAAA records play an important role, but the only thing I get from these are ns rather than the domain. I don't know how to extract the gTLD or ccTLD from the records. I'm open to any suggestions, if anyone is willing to share an answer on how to parse gTLD(s) or ccTLD(s) from any IP. Preferably in python, but a shell script would also suffice.
The socket.gethostbyaddr('172.217.20.14'), would be the right way to go here, but not necessarily. Here's why:
Domain to IP resolution goes like:
domain > root server > origin server > origin server's hostname to IP configurations.
Now to reverse engineer it, we have to take into account:
There can be multiple domains sharing that same IP address as is the case with shared hosting.
Assuming the domain has dedicated IP, the nslookup or gethostbyaddr 'should' return the domain name, but there can be proxy servers in-front, like Cloudflare and whatever Google is using.
So even if you do this manually like try to find out actual IP google's server is running on you cannot, as that would open their central server for all kinds of attacks, most importantly DDoS.

urllib2 - get resource if you already know the IP

In my python script, I am fetching pages but I already know the IP of the server.
So I could save it the hassle of doing a DNS lookup, if I can some how pass in the IP and hostname in the request.
So, if I call
http://111.111.111.111/
and then pass the hostname in the HOST attribute, I should be OK. However the issue I see is on the server side, if the user looks at the incomming request (ie REQUEST_URI) then they will see I went for the IP.
Anyone have any ideas?
First, the main idea is suspicious. Well, you can "know" IP of the server but this knowledge is temporary and its correctness time is controlled by DNS TTLs. For stable configuration, server admin can provide DNS record with long TTL (e.g. a few days) so DNS request will be always fulfilled using the nearest caching resolver or nscd. For changing configuration, TTL can be reduced to a few seconds or ever to 0 (means no caching), and it can be useful for some kind of load balancers. You try to organize your own resolver cache which is TTL ignorant, and this can lead to requests to non-functioning or wrong servers, with incorrect contents. So, I suggest not to do this.
If you are strictly sure you shall do this and you can't use external tools as custom resolver or even /etc/hosts, try to install custom "opener" (see urllib2.build_opener() function in documentation) which overrides DNS lookup. However I didn't do this ever, the knowledge is only on documentation read just now.
You can add the ip address mapping to the hosts file.

Http connect request from multiple IP address to destination in python

conn=httlib.HTTPConnection(self.proxy)
Self.proxy has destination ip and port.
I want to do multiple connection from multiple IP addresses to destination
How to specify the source IP while connect request.Please help me out.
Thanks in Advance.
I assume that you have multiple network connections on the same computer (i.e. a wired and wireless connection) and you want to make sure that your connect goes over a specific interface.
In general you cannot do this. How your traffic is sent to a specific ip address, and therefore what source ip address it shows, is determined by your operating system's routing tables. As you haven't specified what operating system this I can't go into more detail.
You may be able to do this using some of the more advanced routing configuration, but that's an operating system level problem and can't be done through Python.
I got the solution but not 100%
Requirement: Has to send request from 10 Ip address to one destination.
Achieved the same through the following API
class httplib.HTTPConnection(host[, port[, strict[, timeout[, source_address]]]])
here, we can mention the last parameter source IP
Like, httlib.HTTPConnection(dest_ip, dest_port, src_ip)
For Example:httlib.HTTPConnection("198.168.1.5",8080,"198.168.1.1")
created the connection under for loop for 10 unique src ip addresses.
Output: Connected to destination with 10 different port number with same IP address. I don't know why it happens like this.
Problem solved. Thanks for all.

Alternate host/IP for python script

I want my Python script to access a URL through an IP specified in the script instead of through the default DNS for the domain. Basically I want the equivalent of adding an entry to my /etc/hosts file, but I want the change to apply only to my script instead of globally on the whole server. Any ideas?
Whether this works or not will depend on whether the far end site is using HTTP/1.1 named-based virtual hosting or not.
If they're not, you can simply replace the hostname part of the URL with their IP address, per #Greg's answer.
If they are, however, you have to ensure that the correct Host: header is sent as part of the HTTP request. Without that, a virtual hosting web server won't know which site's content to give you. Refer to your HTTP client API (Curl?) to see if you can add or change default request headers.
You can use an explicit IP number to connect to a specific machine by embedding that into the URL: http://127.0.0.1/index.html is equivalent to http://localhost/index.html
That said, it isn't a good idea to use IP numbers instead of DNS entries. IPs change a lot more often than DNS entries, meaning your script has a greater chance of breaking if you hard-code the address instead of letting it resolve normally.

Categories

Resources