Spyder--Web scraping

Spyder--Web scraping - python

Am trying to scrape the data from a website, I have given the username and password but still it is throwing me the below error.
"URLError: urlopen error [Errno 11004] getaddrinfo failed>"
Here is my code:
import urllib.request as req
proxy = req.ProxyHandler({'http':r'http://abca:Password__#123#:192.168.115.116:8080'})
auth = req.HTTPBasicAuthHandler()
opener = req.build_opener(proxy, auth, req.HTTPHandler)
req.install_opener(opener)
conn = req.urlopen('http://www.google.com')
return_str = conn.read()
Thanks & Regards,
Sanjay

URL 101:
Characters should be percent encoded when used, especially the colon and #-sign as they are part of the syntax.

Related

Error checking SSLError the certificate through a proxy, telegram bot, corporate network

I work through a corporate network, there is a need to write a bot telegram, I tried various libraries (`` telebot, telepot, telegram, airogram ''). Everywhere I come across the problem of access through a corporate proxy. Now I stopped at telebot, slipped a proxy, now there is an error with certificate verification. Tried using http, https, socks5, socks5h, socks4 with no success.
import telebot
import json
from telebot import apihelper
with open ('params.json', 'r', encoding = 'utf-8') as f:
data = json.load (f)
username = data ['username']
password = data ['password']
access_token_tg = data ['access_token_tg']
bot = telebot.TeleBot (access_token_tg)
apihelper.proxy = {'https': f'https: // {username}: {password} # 10.0.18.139: 3131'}
print (bot.get_me ())
I get an error in response.
requests.exceptions.SSLError: HTTPSConnectionPool (host = 'api.telegram.org', port = 443): Max retries exceeded with url: / my token here / getMe (Caused by SSLError (SSLCertVerificationError (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c: 1123) ')))
I tried to communicate via request with the cart, I was able to connect, only after disabling the ssl check `` session.verify = False ''. But it is very problematic to write through requests, in fact, there are a lot of libraries for this.
import pprint
from requests.auth import HTTPProxyAuth
import urllib3
import requests
import json
# read config
with open ('params.json', 'r', encoding = 'utf-8') as f:
data = json.load (f)
username = data ['username']
password = data ['password']
access_token_tg = data ['access_token_tg']
# create a session to work with the bot
auth = HTTPProxyAuth (username, password)
proxies = {'https': f'https: // {username}: {password} # 10.0.18.139: 3131'}
session = requests.Session ()
session.proxies = proxies
session.auth = auth
# chopped off the verification of the certificate and warning
urllib3.disable_warnings ()
session.verify = False
MAIN_URL = f'https: //api.telegram.org/bot {access_token_tg} '
res = session.post (f '{MAIN_URL} / getMe'). json ()
pprint.pprint (res)
This is how it works.
I ask you to help me figure it out, or maybe I can advise another manual or bibla, I have been fighting with this for 2 weeks.

How do I built a connection to the internet through VPN, while using a library that connects to websites?

I am connected to the web via VPN and I would like to connect to news site to grab, well, news. For this a library exists: finNews. And this is the code:
import FinNews as fn
cnbc_feed = fn.CNBC(topics=['finance', 'earnings'])
print(cnbc_feed.get_news())
print(cnbc_feed.possible_topics())
Now because of the VPN the connection wont work and it throws:
<urlopen error [WinError 10061] No connection could be made because
the target machine actively refused it ( client - server )
So I started separately to understand how to make a connection work and it does work (return is "connected"):
import urllib.request
proxy = "http://user:pw#proxy:port"
proxies = {"http":"http://%s" % proxy}
url = "http://www.google.com/search?q=test"
headers={'User-agent' : 'Mozilla/5.0'}
try:
proxy_support = urllib.request.ProxyHandler(proxies)
opener = urllib.request.build_opener(proxy_support, urllib.request.HTTPHandler(debuglevel=1))
urllib.request.install_opener(opener)
req = urllib.request.Request(url, None, headers)
html = urllib.request.urlopen(req).read()
#print (html)
print ("Connected")
except (HTTPError, URLError) as err:
print("No internet connection.")
Now I figured how to access news and how to make a connection via VPN, but I cant bring both together. I want to grab the news via the library through VPN?! I am fairly new to Python so I guess I dont get the logic fully yet.
EDIT: I tried to combine with Feedparser, based on furas hint:
import urllib.request
import feedparser
proxy = "http://user:pw#proxy:port"
proxies = {"http":"http://%s" % proxy}
#url = "http://www.google.com/search?q=test"
#url = "http://www.reddit.com/r/python/.rss"
url = "https://timesofindia.indiatimes.com/rssfeedstopstories.cms"
headers={'User-agent' : 'Mozilla/5.0'}
try:
proxy_support = urllib.request.ProxyHandler(proxies)
opener = urllib.request.build_opener(proxy_support, urllib.request.HTTPHandler(debuglevel=1))
urllib.request.install_opener(opener)
req = urllib.request.Request(url, None, headers)
html = urllib.request.urlopen(req).read()
#print (html)
#print ("Connected")
feed = feedparser.parse(html)
#print (feed['feed']['link'])
print ("Number of RSS posts :", len(feed.entries))
entry = feed.entries[1]
print ("Post Title :",entry.title)
except (HTTPError, URLError) as err:
print("No internet connection.")
But same error....this is a big nut to crack...
May I ask for your advice? Thank you :)

Python HTTPConnectionPool Failed to establish a new connection: [Errno 11004] getaddrinfo failed

I was wondering if my requests is stopped by the website and I need to set a proxy.I first try to close the http's connection ,bu I failed.I also try to test my code but now it seems no outputs.Mybe I use a proxy everything will be OK?
Here is the code.
import requests
from urllib.parse import urlencode
import json
from bs4 import BeautifulSoup
import re
from html.parser import HTMLParser
from multiprocessing import Pool
from requests.exceptions import RequestException
import time
def get_page_index(offset, keyword):
#headers = {'User-Agent':'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
data = {
'offset': offset,
'format': 'json',
'keyword': keyword,
'autoload': 'true',
'count': 20,
'cur_tab': 1
}
url = 'http://www.toutiao.com/search_content/?' + urlencode(data)
try:
response = requests.get(url, headers={'Connection': 'close'})
response.encoding = 'utf-8'
if response.status_code == 200:
return response.text
return None
except RequestException as e:
print(e)
def parse_page_index(html):
data = json.loads(html)
if data and 'data' in data.keys():
for item in data.get('data'):
url = item.get('article_url')
if url and len(url) < 100:
yield url
def get_page_detail(url):
try:
response = requests.get(url, headers={'Connection': 'close'})
response.encoding = 'utf-8'
if response.status_code == 200:
return response.text
return None
except RequestException as e:
print(e)
def parse_page_detail(html):
soup = BeautifulSoup(html, 'lxml')
title = soup.select('title')[0].get_text()
pattern = re.compile(r'articleInfo: (.*?)},', re.S)
pattern_abstract = re.compile(r'abstract: (.*?)\.', re.S)
res = re.search(pattern, html)
res_abstract = re.search(pattern_abstract, html)
if res and res_abstract:
data = res.group(1).replace(r".replace(/<br \/>|\n|\r/ig, '')", "") + '}'
abstract = res_abstract.group(1).replace(r"'", "")
content = re.search(r'content: (.*?),', data).group(1)
source = re.search(r'source: (.*?),', data).group(1)
time_pattern = re.compile(r'time: (.*?)}', re.S)
date = re.search(time_pattern, data).group(1)
date_today = time.strftime('%Y-%m-%d')
img = re.findall(r'src="(.*?)&quot', content)
if date[1:11] == date_today and len(content) > 50 and img:
return {
'title': title,
'content': content,
'source': source,
'date': date,
'abstract': abstract,
'img': img[0]
}
def main(offset):
flag = 1
html = get_page_index(offset, '光伏')
for url in parse_page_index(html):
html = get_page_detail(url)
if html:
data = parse_page_detail(html)
if data:
html_parser = HTMLParser()
cwl = html_parser.unescape(data.get('content'))
data['content'] = cwl
print(data)
print(data.get('img'))
flag += 1
if flag == 5:
break
if __name__ == '__main__':
pool = Pool()
pool.map(main, [i*20 for i in range(10)])
and the error is the here!
HTTPConnectionPool(host='tech.jinghua.cn', port=80): Max retries exceeded with url: /zixun/20160720/f191549.shtml (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x00000000048523C8>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
By the way, When I test my code at first it shows everything is OK!
Thanks in advance!

It seems to me you're hitting the limit of connection in the HTTPConnectionPool. Since you start 10 threads at the same time
Try one of the following:
Increase the request timeout (seconds): requests.get('url', timeout=5)
Close the response: Response.close(). Instead of returning response.text, assign response to a varialble, close Response, and then return variable

When I faced this issue I had the following problems
I wasn't able to do the following
- The requests python module was unable to get information from any url. Although I was able to surf the site with browser, also could get wget or curl to download that page.
- pip install was also not working and use to fail with following errors
Failed to establish a new connection: [Errno 11004] getaddrinfo failed
Certain site blocked me so i tried forcebindip to use another network interface for my python modules and then i removed it. Probably that cause my network to mess up and my request module and even the direct socket module were stuck and not able to fetch any url.
So I followed network configuration reset in the below URL and now I am good.
network configuration reset

In case it helps someone else, I faced this same error message:
Client-Request-ID=long-string Retry policy did not allow for a retry: , HTTP status code=Unknown, Exception=HTTPSConnectionPool(host='table.table.core.windows.net', port=443): Max retries exceeded with url: /service(PartitionKey='requests',RowKey='9999') (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001D920ADA970>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')).
...when trying to retrieve a record from Azure Table Storage using
table_service.get_entity(table_name, partition_key, row_key).
My issue:
I had the table_name incorrectly defined.

My structural URL was incorrect (after ".com" there was no slash and there was a coupling of another part of the url)

Sometimes it's due to a VPN connection. I had the same problem. I wasn't even capable of installing the package requests via pip. I turned off my VPN and voilà, I managed to install it and also to make requests. The [Errno 11004] code was gone.

How to overcome urlopen error [WinError 10060] issue?

I have follow this tutorial but I still fail to get output. Below is my code in view.py
def index(request):
#html="a"
#url= requests.get("https://www.python.org/")
#page = urllib.request.urlopen(url)
#soup = BeautifulSoup(page.read())
#soup=url.content
#urllib3.disable_warnings()
#requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
#url=url.content
#default_headers = make_headers(basic_auth='myusername:mypassword')
#http = ProxyManager("https://myproxy.com:8080/", headers=default_headers)
r = urllib.request.urlopen('http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts').read()
soup = BeautifulSoup(r)
url= type(soup)
context={"result":url,}
return render (request, 'index.html',context)
Output:
urlopen error [WinError 10060] A connection attempt failed because the
connected party did not properly respond after a period of time, or
established connection failed because connected host has failed to respond

If you are sitting behind a firewall or similar you might have to specify a proxy for the request to get through.
See below example using the requests library.
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
r = requests.get('http://example.org', proxies=proxies)

Python3: Download DOM from webpage with username/password

I am trying to download the DOM from the Yahoo fantasy football page. The data requires a yahoo user. I am looking for the python library where I can add my user/pass to the request.
urllib has HTTPBasicAuthHandler which needs a HTTPPasswordMgr Objects
the add_password field says I am missing an argument, when I try and pass it the 4 it wants. I am not sure what to put for the realm. I am new to Python.
I have found the Requests to look promising, but when I install it, it throws an error and I can not import it properly :\
I was hoping this was a bit easier to do in Python!
import urllib.request
try:
url = "http://football.fantasysports.yahoo.com/"
username = "un"
password = "pw"
pwObj = urllib.request.HTTPPasswordMgr.add_password("http://yahoo.com",url, username, password)
request = urllib.request.HTTPBasicAuthHandler(pwObj)
result = urllib.request.urlopen(request)
print(result.read())
except Exception as e:
print(str(e))
# Error: add_password() missing 1 required positional argument: 'passwd'
The ideal solution would have someone downloading yahoo DOM data from a page that requires credentials :)

Like this:
import urllib.request
try:
url = "http://football.fantasysports.yahoo.com/"
username = "_username"
password = "_password"
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None, url, username, password)
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
opener = urllib.request.build_opener(handler)
opener.open("http://football.fantasysports.yahoo.com/f1/leaderboard")
urllib.request.install_opener(opener)
result = urllib.request.urlopen(url)
print(result.read())
except Exception as e:
print(str(e))
But, how to run java events ?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spyder--Web scraping - python

URL 101: Characters should be percent encoded when used, especially the colon and #-sign as they are part of the syntax.

Related

Error checking SSLError the certificate through a proxy, telegram bot, corporate network

How do I built a connection to the internet through VPN, while using a library that connects to websites?

Python HTTPConnectionPool Failed to establish a new connection: [Errno 11004] getaddrinfo failed

How to overcome urlopen error [WinError 10060] issue?

Python3: Download DOM from webpage with username/password

Categories

Resources