I wrote a little Python function that uses the BeatifulSoup libary. The function should return a list of symbols from a Wikipedia article.
In the Shell, I execute the function like this:
pythonScript.my_function()
...it throws a error in line 28:
No connections adapters were found for 'link'.
When I type the same code from my function directly in the shell, it works perfectly. With the same link. I have even copied and pasted the lines.
These are the two lines of code I'm talking about, the error appears with the BeautifulSoup function.
response = requests.get('link')
soup = bs4.BeautifulSoup(response.text)
I can't explain why this error happens...
EDIT: here is the full code
#/usr/bin/python
# -*- coding: utf-8 -*-
#insert_symbols.py
from __future__ import print_function
import datetime
from math import ceil
import bs4
import MySQLdb as mdb
import requests
def obtain_parse_wiki_snp500():
'''
Download and parse the Wikipedia list of S&P500
constituents using requests and BeatifulSoup.
Returns a list of tuples for to add to MySQL.
'''
#Stores the current time, for the created at record
now = datetime.datetime.utcnow()
#Use requests and BeautifulSoup to download the
#list of S&P500 companies and obtain the symbol table
response = requests.get('http://en.wikipedia.org/wiki/list_of_S%26P_500_companies')
soup = bs4.BeautifulSoup(response.text)
I think that must be enough. This is the point where the error occurs.
In the Shell Ive done everthing step by step:
importing the libaries and then calling the requests and bs4 function.
The single difference is that in the Shell I didnt defined a function for that.
EDIT2:
Here is the exact Error Message:
Traceback (most recent call last):
File "", line 1, in
File "/home/felix/Dokumente/PythonSAT/DownloadSymbolsFromWikipedia.py", line 28, in obtain_parse_wiki_snp500
soup = bs4.BeautifulSoup(response.text)
File "/home/felix/anaconda3/lib/python3.5/site-packages/requests/api.py", line 67, in get
return request('get', url, params=params, **kwargs)
File "/home/felix/anaconda3/lib/python3.5/site-packages/requests/api.py", line 53, in request
return session.request(method=method, url=url, **kwargs)
File "/home/felix/anaconda3/lib/python3.5/site-packages/requests/sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "/home/felix/anaconda3/lib/python3.5/site-packages/requests/sessions.py", line 570, in send
adapter = self.get_adapter(url=request.url)
File "/home/felix/anaconda3/lib/python3.5/site-packages/requests/sessions.py", line 644, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'htttp://en.wikipedia.org/wiki/list_of_S%26P_500_companies'
You have passed a link with a htttp, i.e you have three t when there should be just two, that code would error wherever you run it. The correct url is:
http://en.wikipedia.org/wiki/list_of_S%26P_500_companies
Once you fix the url the request works fine:
n [1]: import requests
In [2]: requests.get('http://en.wikipedia.org/wiki/list_of_S%26P_500_companies')
Out[3]: <Response [200]>
To get the symbols:
n [28]: for tr in soup.select("table.wikitable.sortable tr + tr"):
sym = tr.select_one("a.external.text")
if sym:
print(sym.text)
....:
MMM
ABT
ABBV
ACN
ATVI
AYI
ADBE
AAP
AES
AET
AFL
AMG
A
GAS
APD
AKAM
ALK
AA
AGN
ALXN
ALLE
ADS
ALL
GOOGL
GOOG
MO
AMZN
AEE
AAL
AEP
and so on ...............
Related
I have a problem scraping data from a seekingalpha website. I know this question has been asked several times so far but the solutions provided didn't help
I have the following block of code:
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
def scrape_news(url, source):
opener = AppURLopener()
if(source=='SeekingAlpha'):
print(url)
with opener.open(url) as response:
s = response.read()
data = BeautifulSoup(s, "lxml")
print(data)
scrape_news('https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer','SeekingAlpha')
Any idea what might be going wrong here?
EDIT:
whole traceback:
Traceback (most recent call last):
File ".\news.py", line 107, in <module>
scrape_news('https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer','SeekingAlpha')
File ".\news.py", line 83, in scrape_news
with opener.open(url) as response:
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\urllib\response.py", line 30, in __enter__
raise ValueError("I/O operation on closed file")
ValueError: I/O operation on closed file
Your URL returns a 403. Try this in your terminal to confirm:
curl -s -o /dev/null -w "%{http_code}" https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer
Or, try this in your Python repl:
import urllib.request
url = 'https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer'
opener = urllib.request.FancyURLopener()
response = opener.open(url)
print(response.getcode())
FancyURLOpener is swallowing any errors about the failure response code, which is why your code continues to the response.read() instead of exiting, even though it hasn't recorded a valid response. The standard urllib.request.urlopen should handle this for you by throwing an exception on a 403 error, otherwise you can handle it yourself.
I am using the following code in an attempt to do webscraping .
import sys , os
import requests, webbrowser,bs4
from PIL import Image
import pyautogui
p = requests.get('http://www.goal.com/en-ie/news/ozil-agent-eviscerates-jealous-keown-over-stupid-comments/1javhtwzz72q113dnonn24mnr1')
n = open("exml.txt" , 'wb')
for i in p.iter_content(1000) :
n.write(i)
n.close()
n = open("exml.txt" , 'r')
soupy= bs4.BeautifulSoup(n,"html.parser")
elems = soupy.select('img[src]')
for u in elems :
print (u)
so what I am intending to do is to extract all the image links that is there in the xml response obtained from the page .
(Please correct me If I am wrong in thinking that requests.get returns the whole static html file of the webpage that opens on entering the URL)
However in the line :
soupy= bs4.BeautifulSoup(n,"html.parser")
I am getting the following error :
Traceback (most recent call last):
File "../../perl/webscratcher.txt", line 24, in <module>
soupy= bs4.BeautifulSoup(n,"html.parser")
File "C:\Users\Kanishc\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\__init__.py", line 191, in __init__
markup = markup.read()
File "C:\Users\Kanishc\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 24662: character maps to <undefined>
I am clueless about the error and the "Appdata" folder is empty .
How to proceed further ?
Post Trying suggestions :
I changed the extension of the filename to py and this error got removed . However on the following line :
soupy= bs4.BeautifulSoup(n,"lxml") I am getting the following error :
Traceback (most recent call last):
File "C:\perl\webscratcher.py", line 23, in
soupy= bs4.BeautifulSoup(p,"lxml")
File "C:\Users\PREMRAJ\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4_init_.py", line 192, in init
elif len(markup) <= 256 and (
TypeError: object of type 'Response' has no len()
How to tackle this ?
You are over-complicating things. Pass the bytes content of a Response object directly into the constructor of the BeautifulSoup object, instead of writing it to a file.
import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.goal.com/en-ie/news/ozil-agent-eviscerates-jealous-keown-over-stupid-comments/1javhtwzz72q113dnonn24mnr1')
soup = BeautifulSoup(response.content, 'lxml')
for element in soup.select('img[src]'):
print(element)
Okay so you you might want to do a review on working with BeautifulSoup. I referenced an old project of mine and this is all you need for printing them. Check the BS documents to find the exact syntax you want with the select method.
This will print all the img tags from the html
import requests, bs4
site = 'http://www.goal.com/en-ie/news/ozil-agent-eviscerates-jealous-keown-over-stupid-comments/1javhtwzz72q113dnonn24mnr1'
p = requests.get(site).text
soupy = bs4.BeautifulSoup(p,"html.parser")
elems = soupy.select('img[src]')
for u in elems :
print (u)
I'm using Python 3 to scrape a Pandas data frame I've created from a csv file that contains the source URLs of 63,067 webpages. The for-loop is supposed to scrape news articles from for a project to place into giant text files for cleaning later on.
I'm a bit rusty with Python and this project is the reason I've started programming in it again. I haven't used BeautifulSoup before, so I'm having some difficulty and just got the for-loop to work on the Pandas data frame with BeautifulSoup.
This is for one of the three data sets I'm using (the other two are programmed into the code below to repeat the same process for different data sets, which is why I'm mentioning this).
from bs4 import BeautifulSoup as BS
import requests, csv
import pandas as pd
negativedata = pd.read_csv('negativedata.csv')
positivedata = pd.read_csv('positivedata.csv')
neutraldata = pd.read_csv('neutraldata.csv')
negativedf = pd.DataFrame(negativedata)
positivedf = pd.DataFrame(positivedata)
neutraldf = pd.DataFrame(neutraldata)
negativeURLS = negativedf[['sourceURL']]
for link in negativeURLS.iterrows():
url = link[1]['sourceURL']
negative = requests.get(url)
negative_content = negative.text
negativesoup = BS(negative_content, "lxml")
for text in negativesoup.find_all('a', href = True):
text.append((text.get('href')))
I think finally got my for-loop to work for the code to run through all of the source URLs. However, I then get the error:
Traceback (most recent call last):
File "./datacollection.py", line 18, in <module>
negative = requests.get(url)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in send
history = [resp for resp in gen] if allow_redirects else []
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 140, in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
I know that the issue is when I'm requesting the URLs, but I'm not sure what–or if a–URL is the problem due to the amount of webpages that are in the data frame being iterated through. Is the problem a URL or that I have too many and should use a different package like scrapy?
I would suggest using modules like mechanize for scraping. Mechanize has a way of handling robots.txt and is much better if your application is scraping data from urls of different websites. But in your case, the redirect is probably because of not having user-agent in headers as mentioned here (https://github.com/requests/requests/issues/3596). And here's how you set headers with requests (Sending "User-agent" using Requests library in Python).
P.S: mechanize is only available for python2.x. If you wish to use python3.x, there are other options (Installing mechanize for python 3.4).
I am unable to download a xls file from a url. I have tried with both urlopen and urlretrive. But I recieve an really long error message starting with:
Traceback (most recent call last):
File "C:/Users/Henrik/Documents/Development/Python/Projects/ImportFromWeb.py", line 6, in
f = ur.urlopen(dls)
File "C:\Users\Henrik\AppData\Local\Programs\Python\Python35\lib\urllib\request.py", line 163, in urlopen
return opener.open(url, data, timeout)
and ending with:
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
Unfortionally I can't provide the url I am using since the data is sensitive. However I will give you the url with some parts removed.
https://xxxx.xxxx.com/xxxxlogistics/w/functions/transportinvoicelist?0-8.IBehaviorListener.2-ListPageForm-table-TableForm-exportToolbar-xlsExport&antiCache=1477160491504
As you can see the url dosn't end with a "/file.xls" for example. I don't know if that matters but most of the threds regarding this issue has had those types of links.
If I enter the url in my address bar the download file window appears:
Image of download window
The code I have written look like this:
import urllib.request as ur
import openpyxl as pyxl
dls = 'https://xxxx.xxxx.com/xxxxlogistics/w/functions/transportinvoicelist?0-8.IBehaviorListener.2-ListPageForm-table-TableForm-exportToolbar-xlsExport&antiCache=1477160491504'
f = ur.urlopen(dls)
I am grateful for any help you can provide!
This program is a very simple example of webscraping. The programs goal is to go on the internet, find a specific stock, and then tell the user the price that the stock is currently trading at. However, I run into the issue in the code that when I compile it, this error message comes up:
Traceback (most recent call last):
File "HTMLwebscrape.py", line 15, in <module>
price = re.findall(pattern,htmltext)
File "C:\Python34\lib\re.py", line 210, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
Below is the actual script of the program. I've tried finding ways to solve this code, but so far, I've been unable to. I've been running this on Python 3 and using Submlime Text as my text editor. Thank you in advance!
import urllib
import re
from urllib.request import Request, urlopen
from urllib.error import URLError
symbolslist = ["AAPL","SPY","GOOG","NFLX"]
i=0
while i < len(symbolslist):
url = Request("http://finance.yahoo.com/q?s=AAPL&ql=1")
htmlfile = urlopen(url)
htmltext = htmlfile.read()
regex = '<span id="yfs_184_'+symbolslist[i] + '"">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern,htmltext)
print (price)
i+=1