Access Denied 403 when webscraping; What to do?

Access Denied 403 when webscraping; What to do? - python

I was testing a scraping algorithm that I had built. I made a request to https://www2.hm.com/fi_fi/miesten.html but misspecified the user-agent information. It seems that this triggered an immediate ban (not sure) Scraping their site should be fine - their robots.txt says: User-agent: *
Disallow: )
Example of making a request to HM and the subsequent server response
I erased the user agent and proxy information due to privacy concerns. However, they are nothing out of the ordinary.
I receive the following as response:
"b'\nAccess Denied\n\n\n \nYou don't have permission to access "http://www2.hm.com/fi_fi/miesten.html" on this server.\nReference #18.2796ef50.1625728417.f9aab80\n\n\n'"
So my question is: is there anything that I can do to lift this ban? Can i connect someone from their end and ask to lift it? If so, where can this information usually be found.
Although this question concern this site in particular, this is a much broader question. In the case of a ban, can the user try to connect someone from the server? I thought about contacting customer support, but I heavily suspect they they cannot help with this issue, and won't even understand what it is about.
I have googled this issue, but not found anything of help. They usually advise to clear cache, memory etc. This is not the problem here. I can access the site via Chrome or other browsers, but when using requests via python, this problem appears.

Pretty sure you need to use a Javascript scraping bot, you can try with this tool: https://docs.python-requests.org/projects/requests-html/en/latest/
And to get contact informations about the owner of a website you can use the unix whois command:
whois hm.com

Related

Selenium - ERR_TOO_MANY_REDIRECTS [duplicate]

I am trying to automate my login to a webpage to download a daily xml. I understand that I need to have the actual frame url I think is
http://shop.braintrust.gr/shop/store/customerauthenticateform.asp
I examine the form and the fields and I do the following
browser = webdriver.Chrome('C:\\chromedriver.exe')
browser.get('http://shop.braintrust.gr/shop/store/customerauthenticateform.asp')
print('Browser Opened')
username = browser.find_element_by_name('UserID')
username.send_keys(email)
password = browser.find_element_by_name('password')
# time.sleep(2)
password.send_keys(pwd)
but I get a blank page saying that browser did a lot of redirections this means that it is impossible to login?
How can I login?
thank you

ERR_TOO_MANY_REDIRECTS
ERR_TOO_MANY_REDIRECTS (also known as a redirect loop) is one of the regular website errors. Typically this error occurs after a recent change to your website, a mis-configuration of redirects on your server or wrong settings with third-party services.
This error have no relation with Selenium as such and can be reproduced through Manual Steps.
The reason for ERR_TOO_MANY_REDIRECTS is that, something is causing your website to go into an infinite redirection loop. Essentially the site is stuck (such as URL 1 points to URL 2 and URL 2 points back to URL 1, or the domain has redirected you too many times) and unlike some other errors, these rarely resolve themselves and will probably need you to take action to fix it. There are a couple different variations of this error depending upon the browser you’re running.
Solution
Some common approach to check and fix the error as as follows:
Delete Cookies on That Specific Site: Google and Mozilla both in fact recommends right below the error to try clearing your cookies. Cookies can sometimes contain faulty data in which could cause the ERR_TOO_MANY_REDIRECTS error. This is one recommendation you can try even if you’re encountering the error on a site you don’t own. Due to the fact that cookies retain your logged in status on sites and other settings, in these cases simply deleting the cookie(s) on the site that is having the problem. This way you won’t impact any of your other sessions or websites that you frequently visit.
Clear Browser Cache: If you want to check and see if it might be your browser cache, without clearing your cache, you can always open up your browser in incognito mode. Or test another browser and see if you still see the ERR_TOO_MANY_REDIRECTS error.
Determine Nature of Redirect Loop: If clearing the cache didn’t work, then you’ll want to see if you can determine the nature of the redirect loop. For example, if a site has a 301 redirect loop back to itself, which is causing a large chain of faulty redirects. You can follow all the redirects and determine whether or not its looping back to itself, or perhaps is an HTTP to HTTPS loop.
Check Your HTTPS Settings: Another thing to check is your HTTPS settings. A lot of times it is observed ERR_TOO_MANY_REDIRECTS occur when someone has just migrated their WordPress site to HTTPS and either didn’t finish or setup something incorrectly.
Check Third-Party Services: ERR_TOO_MANY_REDIRECTS is also often commonly caused by reverse-proxy services such as Cloudflare. This usually happens when their Flexible SSL option is enabled and you already have an SSL certificate installed with your WordPress host. Why? Because, when flexible is selected, all requests to your hosting server are sent over HTTP. Your host server most likely already has a redirect in place from HTTP to HTTPS, and therefore a redirect loop occurs.
Check Redirects on Your Server: Besides HTTP to HTTPS redirects on your server, it can be good to check and make sure there aren’t any additional redirects setup wrong. For example, one bad 301 redirect back to itself could take down your site. Usually, these are found in your server’s config files.

get icloud web service endpoints to fetch data

My question may look silly but I am asking this after too much search on Google, yet not have any clue.
I am using iCloud web services. For that I have converted this Python code to PHP. https://github.com/picklepete/pyicloud
Up to this, everything is working good. When authenticate using icloud username,password I am getting a list of web service URLs as part of response. Now for example to use Contacts web service, I need to use Contact web service URL and add a part to that URL to fetch contacts.
https://p45-contactsws.icloud.com:443/co/startup with some parameters.
The webservice URL https://p45-contactsws.icloud.com:443 is coming in response while authenticating. But the later part, 'co/startup' is there in the python code. I don't know how they found that part. So for some services which is there in Python code, they are working good. But I want to use few other service like https://p45-settingsws.icloud.com:443, https://p45-keyvalueservice.icloud.com:443 etc. and when I try to send request with correct parameters to this other services, I am getting errors like 404 not found or unauthorized access. So I believe that some URL part must be added to this just like contacts. If someone knows how or where can I get correct URL part, I will be really thankful.
Thanks to all in advance for their time reading/answering my question.

I am afraid there doesn't seem to be an official source for these API endpoints, since they seem to be discovered through sniffing the network calls rather than a proper guide from Apple. For example, this presentation, which comes from a forensic tools company, is from 2013 and covers some of the relevant endpoints. Note that iOS was still at versions 5 & 6 then (vs. the current v9.3).
All other code samples on the net basically are using the same set of API endpoints that were originally observed in 2012-2013. (Here's a snippet from another python module with additional URLs you may use.) However, all of them pretty much point to each other as the source.
If you'd like to pursue a different path, Apple now promotes the CloudKit and CloudKit JS solutions for registered apps working with iCloud data.

404: Is there any way to avoid being blocked by website while scraping using scrapy

I was trying to use Scrapy to scrape some website about 70k items. but every time after it scraped about 200 items, theis error will pop up for the rest:
scrapy] DEBUG: Ignoring response <404 http://www.somewebsite.com/1234>: HTTP status code is not handled or not allowed
I believe it is because my spider got blocked by the website, and I tried using random user agent suggested here but it doesn't solve the problem at all. Is there any good suggestions?

If you're being blocked your spider is probably hitting the site too often or too fast.
In addition to a random user agent you can try setting the CONCURRENT_REQUESTS and DOWNLOAD_DELAY options in settings.py. The default is fairly aggressive and will hammer a site.
The other options you have are using proxies or use AWS with nano instances, they get a new IP each reboot.
Remember that scraping is at best a gray area, you absolutely need to respect the site owners. The best way is obviously to seek permission from the owner but failing that you need to make sure your scraping efforts don't stand out from the usual browsing patterns or you'll get blocked in no time.
Some sites use fairly sophisticated techniques to identify scrapers including cookies and javascript as well as just request patterns and time on site etc. There are also a number of cloud based anti-scraping solutions such as distil or shieldsquare which if you're up against you'll need to put in a lot of effort to make your spider look human!

Can you force someone to answer your questions or give you information? Neither can you force a web server. At best you can try to impersonate a client that the web server will answer to. To do that you need to figure out the criteria the server uses to decide whether or not to answer the request, then you can (try to) form a request that will meet the criteria.

Disable SSL for Heroku App (django)

We've decided not to use SSL anymore and unfortunately our server guy has quit and now I need to fix this. I've revoked the certs from Comodo, removed the SSL app from Heroku but that was apparently not enough and now we have serious problems with our site.
When visiting inteokej.nu one gets redirected to the app, but automatically http turns to https and instead of showing the domain (inteokej.nu) the app link is shown https://inteokej.herokuapp.com (I want inteokej.nu to be shown, not the actual app link).
That is a problem but not the biggest problem, which is that it's not possible to use the site anymore (e.g login, the static pages works though). When I try to login I first get a https security error and when I proceed I get to the following page: https://www.inteokej.nu/cgi-sys/defaultwebpage.cgi ("Sorry! If you are the owner of this website, please contact your hosting provider: webmaster#inteokej.nu").
I've now learned the hard way that SSL is a complex thing but I really need to get this site up again as soon as possible. So, where should I start and how could I proceed from this point? I guess there's some back end coding that should be done in the django code as well?
Thanks a lot in advance!

Your issue doesn't seem to be with SSL but DNS or at least however your server guy set things up.
The error page you're seeing isn't a Heroku error, inteokej.nu isn't being hosted on Heroku but on a server run by your DNS provider svenskadomaner.se .
If you use the Firefox Live HTTP Headers plugin you can follow the request/response cycle and you'll see that there is a 301 redirect from www.inteokej.nu to inteokej.herokuapp.com (probably an .htaccess redirect).
Check the DNS records for your domain (like here http://viewdns.info/dnsrecord/?domain=inteokej.nu ) you'll see that there is no CNAME record to Heroku, only an A Record to 46.22.116.5 which is an IP Address owned by svenskadomaner.se.
So the thing to do is to set up the custom domain as recommended on Heroku's site:
https://devcenter.heroku.com/articles/custom-domains
and set the CNAME to Heroku's recommendation.
One reason your server guy might have set things up like they did is that Heroku doesn't easily allow "naked domains", so people often do .htaccess redirects from example.com to www.example (which does work easily with CNAMEs).
Good luck!

how to crawl a 403 forbidden SNS

i'm crawling an SNS with crawler written in python
it works for a long time, but few days ago, the webpages got from my severs were ERROR 403 FORBIDDEN.
i tried to change the cookie, change the browser, change the account, but all failed.
and it seems that are the forbidden severs are in the same network segment.
what can i do? steal someone else's ip? = =...
thx a lot

Looks like you've been blacklisted at the router level in that subnet, perhaps because you (or somebody else in the subnet) was violating terms of use, robots.txt, max crawling frequency as specified in a site-map, or something like that.
The solution is not technical, but social: contact the webmaster, be properly apologetic, learn what exactly you (or one of your associates) had done wrong, convincingly promise to never do it again, apologize again until they remove the blacklisting. If you can give that webmaster any reason why they should want to let you crawl that site (e.g., your crawling feeds a search engine that will bring them traffic, or something like this), so much the better!-)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.