I'm trying to make urllib requests to http://google.com in Python 3 (I rewrote it in 2.7 using urllib2 as well, same issue). Below is some of my code:
import urllib.request
from urllib.request import urlopen
import http.cookiejar
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.91 Safari/537.36')]
def makeRequest():
search = 'http://google.com'
print('About to search...')
response = opener.open(search).read()
print('Done')
makeRequest()
When I run this code, it runs in about 14 seconds:
real 0m14.386s
user 0m0.087s
sys 0m0.027s
This seems to be the case with any Google site (Gmail, Google Play, etc.). When I change the search variable to a different site, such as Stackoverflow or Twitter, it runs in well under half a second:
real 0m0.277s
user 0m0.085s
sys 0m0.017s
Does anyone know what could be causing the slow response from Google?
First, you can use ping or traceroute to google.com and others sites to compare the time delay to see if the DNS issue.
Second, you can use wireshark to sniffer every packets to see if something wrong with the communication.
I think may be DNS issue, but I can't make sure that.
Related
I have been trying to login to a website using python 3.6 but it has proven to be more difficult than i originally anticipated. So far this is my code:
import urllib.request
import urllib.parse
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36"
url = "https://www.pinterest.co.uk/login/"
data = {
"email" : "my#email",
"password" : "my_password"}
data = urllib.parse.urlencode(data)
data = data.encode("utf-8")
request = urllib.request.Request(url, headers = headers, data = data)
response = urllib.request.urlopen(request)
responseurl = response.geturl()
print(responseurl)
This throws up a 403 error (forbidden), and I'm not sure why as I have added my email, passcode and even changed the user agent. Am I just missing something simple like a cookiejar?
If possible is there a way to do this without using the requests module as this is a challenge that I have been given to do this with only inbuilt modules (but I am allowed to get help so I'm not cheating)
Most sites will use a csrf token or other means to block exactly what you are attempting to do. One possible workaround would be to utilize a browser automation framework such as selenium and log in through the site's UI
This is my first time posting on stack so I'll do my best to get right to the point regarding my issue. I've just begun delving into the formulation of HTTP requests for web scraping purposes and I decided to choose one site to practice logging in using the requests library in python. I already took the liberty of extracting the csrfKey from the html on the first get but after the post, I still end up on the login page with the fields filled out but I do not successfully log in. Any help would be much appreciated as I'm completely stumped on what I should try next. Thank you all!
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36'}
payload = {
'auth':'username',
'password':'pass',
'login_standard_submitted:':'1',
'remember_me':'0',
'remember_me_checkbox':'1',
'signin_anonymous':'0'
}
s = requests.Session()
r = s.get("http://www.gaming-asylum.com/forums/index.php?/",headers=headers)
soup = BeautifulSoup(r.content)
payload['csrfKey'] = str(soup.find('input',{'name':'csrfKey'}).get('value'))
headers['Content-Type'] = 'application/x-www-form-urlencoded'
headers['Referer'] = 'http://www.gaming-asylum.com/forums/?_fromLogout=1&_fromLogin=1'
headers['Upgrade-Insecure-Requests']='1'
headers['Origin']='http://www.gaming-asylum.com'
r= s.post("http://www.gaming-asylum.com/forums/index.php?/login/",headers=headers,data=payload)
I am using Python requests get method to query the MediaWiki API, but it takes a lot of time to receive the response. The same requests receive the response very fast through a web browser. I have the same issue requesting google.com. Here are the sample codes that I am trying in Python 3.5 on Windows 10:
response = requests.get("https://www.google.com")
response = requests.get("https://en.wikipedia.org/wiki/Main_Page")
response = requests.get("http://en.wikipedia.org/w/api.php?", params={'action':'query', 'format':'json', 'titles':'Labor_mobility'})
However, I don't face this issue retrieving other websites like:
response = requests.get("http://www.stackoverflow.com")
response = requests.get("https://www.python.org/")
This sounds like there is an issue with the underlying connection to the server, because requests to other URLs work. These come to mind:
The server might only allow specific user-agent strings
Try adding innocuous headers, e.g.: requests.get("https://www.example.com", headers={"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"})
The server rate-limits you
Wait for a few minutes, then try again. If this solves your issue, you could slow down your code by adding time.sleep() to prevent being rate-limited again.
IPv6 does not work, but IPv4 does
Verify by executing curl --ipv6 -v https://www.example.com. Then, compare to curl --ipv4 -v https://www.example.com. If the latter is significantly faster, you might have a problem with your IPv6 connection. Check here for possible solutions.
Didn't solve your issue?
If that did not solve your issue, I have collected some other possible solutions here.
I'm trying to add functionality to a headless webbrowser. I know there are easier ways but I stumbled across seleniumrequests an it sparked my interest. I was wondering if there would be a way to add request headers as well as being able to POST data as a payload. I've done some searching around and haven't had much luck. The following prints the html of the first website and screenshots for verification and then my program just hangs on the POST request. Doesn't terminate or raise an exception or anything. Where am I going wrong?
Thanks!
#!/usr/bin/env python
from seleniumrequests import PhantomJS
from selenium import webdriver
#Setting user-agent
webdriver.DesiredCapabilities.PHANTOMJS['phantomjs.page.customHeaders.User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36'
browser = PhantomJS()
browser.get('http://www.google.com')
print browser.page_source
browser.save_screenshot('preSearch.png')
searchReq='https://www.google.com/complete/search?'
data={"q":"this is my search term"}
resp = browser.request('POST', str(searchReq), data=data)
print resp
browser.save_screenshot('postSearch.png')
I am building a broken link checker using Python 3.4 to help ensure the quality of a large collection of articles that I manage. Initially I was using GET requests to check if a link was viable, however I and trying to be as nice as possible when pinging the URLs I'm checking, so I both ensure that I do not check a URL that is tested as working more than once and I have attempted to do just head requests.
However, I have found a site that causes this to simply stop. It neither throws an error, nor opens:
https://www.icann.org/resources/pages/policy-2012-03-07-en
The link itself is fully functional. So ideally I'd like to find a way to process similar links. This code in Python 3.4 will reproduce the issue:
import urllib
import urllib.request
URL = 'https://www.icann.org/resources/pages/policy-2012-03-07-en'
req=urllib.request.Request(URL, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'}, method='HEAD')>>> from http.cookiejar import CookieJar
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
response = opener.open(req)
As it does not throw an error, I really do not know how to troubleshoot this further beyond narrowing it down to the link that halted the entire checker. How can I check if this link is valid?
From bs4 import BeautifulSoup,SoupStrainer
import urllib2
import requests
import re
import certifi
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
def getStatus(url):
a=requests.get(url,verify=False)
report = str(a.status_code)
return report
alllinks=[]
passlinks=[]
faillinks=[]
html_page = urllib2.urlopen("https://link")
soup = BeautifulSoup(html_page,"html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("^http*")}):
#print link.get('href')
status = getStatus(link.get('href'))
#print ('URL---->',link.get('href'),'Status---->',status)
link='URL---->',link.get('href'),'Status---->',status
alllinks.append(link)
if status == '200':
passlinks.append(link)
else:
faillinks.append(link)
print alllinks
print passlinks
print faillinks