Big requests issue: GET doesnt release/reset TCP connections, loop crashes - python

im using python3.3 and the requests module to scrape links from an arbitrary webpage. My program works as follows: I have a list of urls which in the beginning has just the starting url in it.
The program loops over that list and gives the urls to a procedure GetLinks, where im using requests.get and Beautifulsoup to extract all links. Before that procedure appends links to my urllist it gives them to another procedure testLinks to see whether its an internal, external or broken link. In the testLinks im using requests.get too to be able to handle redirects etc.
The program worked really well so far, i tested it with quite some wesites and was able to get all links of pages with like 2000 sites etc. But yesterday i encountered a problem on one page, by looking on the Kaspersky Network Monitor. On this page some TCP connections just dont reset, it seems to me that in that case, the initial request for my first url dont get reset, the connection time is as long as my program runs.
Ok so far. My first try was to use requests.head instead of .get in my testLinks procedure. And then everything works fine! The connections are released as wanted. But the problem is, the information i get from requests.head is not sufficient, im not able to see the redirected url and how many redirects took place.
Then i tried requests.head with
allow_redirects=True
But unfortunately this is not a real .head request, it is a usual .get request. So i got the same problem. I also tried to use to set the parameter
keep_alive=False
but it didnt work either. I even tried to use urllib.request(url).geturl() in my testLinks for redirect issues, but here the same problem occurs, the TCP connections dont get reset.
I tried so much to avoid this problem, i used request sessions but it also had the same problem. I also tried a request.post with the header information Connection: close but it didnt worked.
I analyzed some links where i think it gets struck and so far i believe it has something to do with redirects like 301->302. But im really not sure because on all the other websites i tested it there mustve been such a redirect, they are quite common.
I hope someone can help me. For Information im using a VPN connection to be able to see all websites, because the country im in right now blocks some pages, which are interesting for me. But of course i tested it without the VPN and i had the same problem.
Maybe theres a workaround, because request.head in testLinks is sufficient if i just would be able in case of redirects to see the finnish url and maybe the number of redirects.
If the text is not well readable, i will provide a scheme of my code.
Thanks alot!

Related

How to be undetectable with chrome webdriver?

I've already seen multiple posts on Stackoverflow regarding this. However, some of the answers are outdated (such as using PhantomJS) and others didn't work for me.
I'm using selenium to scrape a few sports websites for their data. However, every time I try to scrape these sites, a few of them block me because they know I'm using chromedriver. I'm not sending very many requests at all, and I'm also using a VPN. I know the issue is with chromedriver because anytime I stop running my code but try opening these sites on chromedriver, I'm still blocked. However, when I open them in my default web browser, I can access them perfectly fine.
So, I wanted to know if anyone has any suggestions of how to avoid getting blocked from these sites when scraping them in selenium. I've already tried changing the '$cdc...' variable within the chromedriver, but that didn't work. I would greatly appreciate any ideas, thanks!
Obviously they can tell you're not using a common browser. Could it have something to do with the User Agent?
Try it out with something like Postman. See what the responses are. Try messing with the user agent and other request fields. Look at the request headers when you access the site with a regular browser (like chrome) and try to spoof those.
Edit: just remembered this and realized the page might be performing some checks in JS and whatnot. It's worth looking into what happens when you block JS on the site with a regular browser.

Python urllib2: Getting blocked from a website?

I had been using urllib2 to parse data from html webpages. It was working perfectly for some time and stopped working permanently from one website.
Not only did the script stop working, but I was no longer able to access the website at all, from any browser. In fact, the only way I could reach the website was from a proxy, leading me to believe that requests from my computer were blocked.
Is this possible? Has this happened to anyone else? If that is the case, is there anyway to get unblocked?
It is indeed possible, maybe the sysadmin noticed that your IP was making way too many requests and decided to block it.
It could also be that the server has a limit of requests that you exceeded.
If you don't have a static IP, a restart of your router should reset your IP, making the ban useless.

Python requests to online store not checking out

I am new here so bear with me if I break the etiquette for this forum. Anyway, I've been working on a python project for a while now and I'm nearing the end but I've been dealing with the same problem for a couple of days now and I can't figure out what the issue is.
I'm using python and the requests module to send a post request to the checkout page of an online store. The response I get when i send it in is the page where you put in your information, not the page that says your order was confirmed, etc.
At first I thought that it could be the form data that I was sending in, and I was right. I checked what it was supposed to be in the network tab on chrome and i saw I was sending in 'Visa' and it was supposed to be 'visa' but it still didn't work after that. Then I thought it could be the encoding but I have no clue how to check what kind the site takes.
Do any of you have any ideas of what could be preventing this from working? Thanks.
EDIT: I realized that I wasn't sending a Cookie in the request headers, so I fixed that and it's still not working. I set up a server script that prints the request on another computer and posted to that instead and the requests are exactly the same, both headers and body. I have no clue what it could possibly be.

capture http calls and headers in python

We are testing videos in our website, and in order to play it should authenticate the user, get the authorization for the device he is playing and so on, check his entitlements etc.,
we have many varieties of network and video types to test. And I am in process of writing script which checks one of those calls are working fine for all type of videos.
Which is a POST call and need to build the param/data to post. Theres no direct way of getting one of the param value and this is how we do currently. We go to browser and play the video. Open the dev tools like firebug and capture the param value from the request header of the same call and use it in my script for rest of other 100 different calls to verify programmatically.
Is there a way in python to do the steps which we are manually doing? like open a url and capture all the calls which happens at the background just like how firebug does?
I am trying firePython, firelogger, mechanize to see if they help . but I have invested so much time figuring out doing this, so thought its time to approach some expert advice.
If you haven't looked at the Requests library, it's generally quite pleasant to work with and might make your life easier.

how to crawl a 403 forbidden SNS

i'm crawling an SNS with crawler written in python
it works for a long time, but few days ago, the webpages got from my severs were ERROR 403 FORBIDDEN.
i tried to change the cookie, change the browser, change the account, but all failed.
and it seems that are the forbidden severs are in the same network segment.
what can i do? steal someone else's ip? = =...
thx a lot
Looks like you've been blacklisted at the router level in that subnet, perhaps because you (or somebody else in the subnet) was violating terms of use, robots.txt, max crawling frequency as specified in a site-map, or something like that.
The solution is not technical, but social: contact the webmaster, be properly apologetic, learn what exactly you (or one of your associates) had done wrong, convincingly promise to never do it again, apologize again until they remove the blacklisting. If you can give that webmaster any reason why they should want to let you crawl that site (e.g., your crawling feeds a search engine that will bring them traffic, or something like this), so much the better!-)

Categories

Resources