i'm crawling an SNS with crawler written in python
it works for a long time, but few days ago, the webpages got from my severs were ERROR 403 FORBIDDEN.
i tried to change the cookie, change the browser, change the account, but all failed.
and it seems that are the forbidden severs are in the same network segment.
what can i do? steal someone else's ip? = =...
thx a lot
Looks like you've been blacklisted at the router level in that subnet, perhaps because you (or somebody else in the subnet) was violating terms of use, robots.txt, max crawling frequency as specified in a site-map, or something like that.
The solution is not technical, but social: contact the webmaster, be properly apologetic, learn what exactly you (or one of your associates) had done wrong, convincingly promise to never do it again, apologize again until they remove the blacklisting. If you can give that webmaster any reason why they should want to let you crawl that site (e.g., your crawling feeds a search engine that will bring them traffic, or something like this), so much the better!-)
Related
I was testing a scraping algorithm that I had built. I made a request to https://www2.hm.com/fi_fi/miesten.html but misspecified the user-agent information. It seems that this triggered an immediate ban (not sure) Scraping their site should be fine - their robots.txt says: User-agent: *
Disallow: )
Example of making a request to HM and the subsequent server response
I erased the user agent and proxy information due to privacy concerns. However, they are nothing out of the ordinary.
I receive the following as response:
"b'\nAccess Denied\n\n\n \nYou don't have permission to access "http://www2.hm.com/fi_fi/miesten.html" on this server.\nReference #18.2796ef50.1625728417.f9aab80\n\n\n'"
So my question is: is there anything that I can do to lift this ban? Can i connect someone from their end and ask to lift it? If so, where can this information usually be found.
Although this question concern this site in particular, this is a much broader question. In the case of a ban, can the user try to connect someone from the server? I thought about contacting customer support, but I heavily suspect they they cannot help with this issue, and won't even understand what it is about.
I have googled this issue, but not found anything of help. They usually advise to clear cache, memory etc. This is not the problem here. I can access the site via Chrome or other browsers, but when using requests via python, this problem appears.
Pretty sure you need to use a Javascript scraping bot, you can try with this tool: https://docs.python-requests.org/projects/requests-html/en/latest/
And to get contact informations about the owner of a website you can use the unix whois command:
whois hm.com
I am new here so bear with me if I break the etiquette for this forum. Anyway, I've been working on a python project for a while now and I'm nearing the end but I've been dealing with the same problem for a couple of days now and I can't figure out what the issue is.
I'm using python and the requests module to send a post request to the checkout page of an online store. The response I get when i send it in is the page where you put in your information, not the page that says your order was confirmed, etc.
At first I thought that it could be the form data that I was sending in, and I was right. I checked what it was supposed to be in the network tab on chrome and i saw I was sending in 'Visa' and it was supposed to be 'visa' but it still didn't work after that. Then I thought it could be the encoding but I have no clue how to check what kind the site takes.
Do any of you have any ideas of what could be preventing this from working? Thanks.
EDIT: I realized that I wasn't sending a Cookie in the request headers, so I fixed that and it's still not working. I set up a server script that prints the request on another computer and posted to that instead and the requests are exactly the same, both headers and body. I have no clue what it could possibly be.
My question may look silly but I am asking this after too much search on Google, yet not have any clue.
I am using iCloud web services. For that I have converted this Python code to PHP. https://github.com/picklepete/pyicloud
Up to this, everything is working good. When authenticate using icloud username,password I am getting a list of web service URLs as part of response. Now for example to use Contacts web service, I need to use Contact web service URL and add a part to that URL to fetch contacts.
https://p45-contactsws.icloud.com:443/co/startup with some parameters.
The webservice URL https://p45-contactsws.icloud.com:443 is coming in response while authenticating. But the later part, 'co/startup' is there in the python code. I don't know how they found that part. So for some services which is there in Python code, they are working good. But I want to use few other service like https://p45-settingsws.icloud.com:443, https://p45-keyvalueservice.icloud.com:443 etc. and when I try to send request with correct parameters to this other services, I am getting errors like 404 not found or unauthorized access. So I believe that some URL part must be added to this just like contacts. If someone knows how or where can I get correct URL part, I will be really thankful.
Thanks to all in advance for their time reading/answering my question.
I am afraid there doesn't seem to be an official source for these API endpoints, since they seem to be discovered through sniffing the network calls rather than a proper guide from Apple. For example, this presentation, which comes from a forensic tools company, is from 2013 and covers some of the relevant endpoints. Note that iOS was still at versions 5 & 6 then (vs. the current v9.3).
All other code samples on the net basically are using the same set of API endpoints that were originally observed in 2012-2013. (Here's a snippet from another python module with additional URLs you may use.) However, all of them pretty much point to each other as the source.
If you'd like to pursue a different path, Apple now promotes the CloudKit and CloudKit JS solutions for registered apps working with iCloud data.
I was trying to use Scrapy to scrape some website about 70k items. but every time after it scraped about 200 items, theis error will pop up for the rest:
scrapy] DEBUG: Ignoring response <404 http://www.somewebsite.com/1234>: HTTP status code is not handled or not allowed
I believe it is because my spider got blocked by the website, and I tried using random user agent suggested here but it doesn't solve the problem at all. Is there any good suggestions?
If you're being blocked your spider is probably hitting the site too often or too fast.
In addition to a random user agent you can try setting the CONCURRENT_REQUESTS and DOWNLOAD_DELAY options in settings.py. The default is fairly aggressive and will hammer a site.
The other options you have are using proxies or use AWS with nano instances, they get a new IP each reboot.
Remember that scraping is at best a gray area, you absolutely need to respect the site owners. The best way is obviously to seek permission from the owner but failing that you need to make sure your scraping efforts don't stand out from the usual browsing patterns or you'll get blocked in no time.
Some sites use fairly sophisticated techniques to identify scrapers including cookies and javascript as well as just request patterns and time on site etc. There are also a number of cloud based anti-scraping solutions such as distil or shieldsquare which if you're up against you'll need to put in a lot of effort to make your spider look human!
Can you force someone to answer your questions or give you information? Neither can you force a web server. At best you can try to impersonate a client that the web server will answer to. To do that you need to figure out the criteria the server uses to decide whether or not to answer the request, then you can (try to) form a request that will meet the criteria.
im using python3.3 and the requests module to scrape links from an arbitrary webpage. My program works as follows: I have a list of urls which in the beginning has just the starting url in it.
The program loops over that list and gives the urls to a procedure GetLinks, where im using requests.get and Beautifulsoup to extract all links. Before that procedure appends links to my urllist it gives them to another procedure testLinks to see whether its an internal, external or broken link. In the testLinks im using requests.get too to be able to handle redirects etc.
The program worked really well so far, i tested it with quite some wesites and was able to get all links of pages with like 2000 sites etc. But yesterday i encountered a problem on one page, by looking on the Kaspersky Network Monitor. On this page some TCP connections just dont reset, it seems to me that in that case, the initial request for my first url dont get reset, the connection time is as long as my program runs.
Ok so far. My first try was to use requests.head instead of .get in my testLinks procedure. And then everything works fine! The connections are released as wanted. But the problem is, the information i get from requests.head is not sufficient, im not able to see the redirected url and how many redirects took place.
Then i tried requests.head with
allow_redirects=True
But unfortunately this is not a real .head request, it is a usual .get request. So i got the same problem. I also tried to use to set the parameter
keep_alive=False
but it didnt work either. I even tried to use urllib.request(url).geturl() in my testLinks for redirect issues, but here the same problem occurs, the TCP connections dont get reset.
I tried so much to avoid this problem, i used request sessions but it also had the same problem. I also tried a request.post with the header information Connection: close but it didnt worked.
I analyzed some links where i think it gets struck and so far i believe it has something to do with redirects like 301->302. But im really not sure because on all the other websites i tested it there mustve been such a redirect, they are quite common.
I hope someone can help me. For Information im using a VPN connection to be able to see all websites, because the country im in right now blocks some pages, which are interesting for me. But of course i tested it without the VPN and i had the same problem.
Maybe theres a workaround, because request.head in testLinks is sufficient if i just would be able in case of redirects to see the finnish url and maybe the number of redirects.
If the text is not well readable, i will provide a scheme of my code.
Thanks alot!