I have a problem scraping data from a seekingalpha website. I know this question has been asked several times so far but the solutions provided didn't help
I have the following block of code:
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
def scrape_news(url, source):
opener = AppURLopener()
if(source=='SeekingAlpha'):
print(url)
with opener.open(url) as response:
s = response.read()
data = BeautifulSoup(s, "lxml")
print(data)
scrape_news('https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer','SeekingAlpha')
Any idea what might be going wrong here?
EDIT:
whole traceback:
Traceback (most recent call last):
File ".\news.py", line 107, in <module>
scrape_news('https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer','SeekingAlpha')
File ".\news.py", line 83, in scrape_news
with opener.open(url) as response:
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\urllib\response.py", line 30, in __enter__
raise ValueError("I/O operation on closed file")
ValueError: I/O operation on closed file
Your URL returns a 403. Try this in your terminal to confirm:
curl -s -o /dev/null -w "%{http_code}" https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer
Or, try this in your Python repl:
import urllib.request
url = 'https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer'
opener = urllib.request.FancyURLopener()
response = opener.open(url)
print(response.getcode())
FancyURLOpener is swallowing any errors about the failure response code, which is why your code continues to the response.read() instead of exiting, even though it hasn't recorded a valid response. The standard urllib.request.urlopen should handle this for you by throwing an exception on a 403 error, otherwise you can handle it yourself.
Related
I'm using Python 3 to scrape a Pandas data frame I've created from a csv file that contains the source URLs of 63,067 webpages. The for-loop is supposed to scrape news articles from for a project to place into giant text files for cleaning later on.
I'm a bit rusty with Python and this project is the reason I've started programming in it again. I haven't used BeautifulSoup before, so I'm having some difficulty and just got the for-loop to work on the Pandas data frame with BeautifulSoup.
This is for one of the three data sets I'm using (the other two are programmed into the code below to repeat the same process for different data sets, which is why I'm mentioning this).
from bs4 import BeautifulSoup as BS
import requests, csv
import pandas as pd
negativedata = pd.read_csv('negativedata.csv')
positivedata = pd.read_csv('positivedata.csv')
neutraldata = pd.read_csv('neutraldata.csv')
negativedf = pd.DataFrame(negativedata)
positivedf = pd.DataFrame(positivedata)
neutraldf = pd.DataFrame(neutraldata)
negativeURLS = negativedf[['sourceURL']]
for link in negativeURLS.iterrows():
url = link[1]['sourceURL']
negative = requests.get(url)
negative_content = negative.text
negativesoup = BS(negative_content, "lxml")
for text in negativesoup.find_all('a', href = True):
text.append((text.get('href')))
I think finally got my for-loop to work for the code to run through all of the source URLs. However, I then get the error:
Traceback (most recent call last):
File "./datacollection.py", line 18, in <module>
negative = requests.get(url)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in send
history = [resp for resp in gen] if allow_redirects else []
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 140, in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
I know that the issue is when I'm requesting the URLs, but I'm not sure what–or if a–URL is the problem due to the amount of webpages that are in the data frame being iterated through. Is the problem a URL or that I have too many and should use a different package like scrapy?
I would suggest using modules like mechanize for scraping. Mechanize has a way of handling robots.txt and is much better if your application is scraping data from urls of different websites. But in your case, the redirect is probably because of not having user-agent in headers as mentioned here (https://github.com/requests/requests/issues/3596). And here's how you set headers with requests (Sending "User-agent" using Requests library in Python).
P.S: mechanize is only available for python2.x. If you wish to use python3.x, there are other options (Installing mechanize for python 3.4).
When I use urllib in Python3 to get the HTML code of a web page, I use this code:
def getHTML(url):
request = Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
html = urlopen(request).read().decode('utf-8')
print(html)
return html
However, this fails every time with the error:
Traceback (most recent call last):
File "/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 56, in <module>
getHTML('https://www.hltv.org/team/7900/spirit-academy')
File "/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 53, in getHTML
print(html)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10636-10638: ordinal not in range(128)
[Finished in 1.14s]
The page is in UTF-8 and I am decoding it properly according to the urllib docs. The page is not gzipped or in another charset from what I can tell.
url.info().get_charset() returns None for the page, however the meta tags specify UTF-8. I have no problems viewing the HTML in any program.
I do not want to use any external libraries.
Is there a solution? What is going on? This works fine with the following Python2 code:
def getHTML(url):
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open(url)
html = response.read()
return html
You don't need to decode('utf-8')
The following should return the fetched html.
def getHTML(url):
request = Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
html = urlopen(request).read()
return html
There, found your error, the parsing was done just fine, everything was evaluated alright. But when you read the Traceback carefully:
Traceback (most recent call last): File
"/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 56, in <module>
getHTML('hltv.org/team/7900/spirit-academy') File
"/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 53, in getHTML
print(html)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10636-10638: ordinal not in range(128)
[Finished in 1.14s]
The error was caused by the print statement, as you can see, this is in the traceback print(html).
This is somewhat common exception, it's just telling you that with your current system encoding, some of the text cannot be printed to the console. One simple solution will be to add print(html.encode('ascii', 'ignore')) to ignore all the unprintable characters. You still can do all the other stuff with html, it's just that you can't print it.
See this if you want a better "fix": https://wiki.python.org/moin/PrintFails
btw: re module can search byte strings. Copy this exactly as-is, will work:
import re
print(re.findall(b'hello', b'hello world'))
I am unable to download a xls file from a url. I have tried with both urlopen and urlretrive. But I recieve an really long error message starting with:
Traceback (most recent call last):
File "C:/Users/Henrik/Documents/Development/Python/Projects/ImportFromWeb.py", line 6, in
f = ur.urlopen(dls)
File "C:\Users\Henrik\AppData\Local\Programs\Python\Python35\lib\urllib\request.py", line 163, in urlopen
return opener.open(url, data, timeout)
and ending with:
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
Unfortionally I can't provide the url I am using since the data is sensitive. However I will give you the url with some parts removed.
https://xxxx.xxxx.com/xxxxlogistics/w/functions/transportinvoicelist?0-8.IBehaviorListener.2-ListPageForm-table-TableForm-exportToolbar-xlsExport&antiCache=1477160491504
As you can see the url dosn't end with a "/file.xls" for example. I don't know if that matters but most of the threds regarding this issue has had those types of links.
If I enter the url in my address bar the download file window appears:
Image of download window
The code I have written look like this:
import urllib.request as ur
import openpyxl as pyxl
dls = 'https://xxxx.xxxx.com/xxxxlogistics/w/functions/transportinvoicelist?0-8.IBehaviorListener.2-ListPageForm-table-TableForm-exportToolbar-xlsExport&antiCache=1477160491504'
f = ur.urlopen(dls)
I am grateful for any help you can provide!
I wrote a little Python function that uses the BeatifulSoup libary. The function should return a list of symbols from a Wikipedia article.
In the Shell, I execute the function like this:
pythonScript.my_function()
...it throws a error in line 28:
No connections adapters were found for 'link'.
When I type the same code from my function directly in the shell, it works perfectly. With the same link. I have even copied and pasted the lines.
These are the two lines of code I'm talking about, the error appears with the BeautifulSoup function.
response = requests.get('link')
soup = bs4.BeautifulSoup(response.text)
I can't explain why this error happens...
EDIT: here is the full code
#/usr/bin/python
# -*- coding: utf-8 -*-
#insert_symbols.py
from __future__ import print_function
import datetime
from math import ceil
import bs4
import MySQLdb as mdb
import requests
def obtain_parse_wiki_snp500():
'''
Download and parse the Wikipedia list of S&P500
constituents using requests and BeatifulSoup.
Returns a list of tuples for to add to MySQL.
'''
#Stores the current time, for the created at record
now = datetime.datetime.utcnow()
#Use requests and BeautifulSoup to download the
#list of S&P500 companies and obtain the symbol table
response = requests.get('http://en.wikipedia.org/wiki/list_of_S%26P_500_companies')
soup = bs4.BeautifulSoup(response.text)
I think that must be enough. This is the point where the error occurs.
In the Shell Ive done everthing step by step:
importing the libaries and then calling the requests and bs4 function.
The single difference is that in the Shell I didnt defined a function for that.
EDIT2:
Here is the exact Error Message:
Traceback (most recent call last):
File "", line 1, in
File "/home/felix/Dokumente/PythonSAT/DownloadSymbolsFromWikipedia.py", line 28, in obtain_parse_wiki_snp500
soup = bs4.BeautifulSoup(response.text)
File "/home/felix/anaconda3/lib/python3.5/site-packages/requests/api.py", line 67, in get
return request('get', url, params=params, **kwargs)
File "/home/felix/anaconda3/lib/python3.5/site-packages/requests/api.py", line 53, in request
return session.request(method=method, url=url, **kwargs)
File "/home/felix/anaconda3/lib/python3.5/site-packages/requests/sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "/home/felix/anaconda3/lib/python3.5/site-packages/requests/sessions.py", line 570, in send
adapter = self.get_adapter(url=request.url)
File "/home/felix/anaconda3/lib/python3.5/site-packages/requests/sessions.py", line 644, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'htttp://en.wikipedia.org/wiki/list_of_S%26P_500_companies'
You have passed a link with a htttp, i.e you have three t when there should be just two, that code would error wherever you run it. The correct url is:
http://en.wikipedia.org/wiki/list_of_S%26P_500_companies
Once you fix the url the request works fine:
n [1]: import requests
In [2]: requests.get('http://en.wikipedia.org/wiki/list_of_S%26P_500_companies')
Out[3]: <Response [200]>
To get the symbols:
n [28]: for tr in soup.select("table.wikitable.sortable tr + tr"):
sym = tr.select_one("a.external.text")
if sym:
print(sym.text)
....:
MMM
ABT
ABBV
ACN
ATVI
AYI
ADBE
AAP
AES
AET
AFL
AMG
A
GAS
APD
AKAM
ALK
AA
AGN
ALXN
ALLE
ADS
ALL
GOOGL
GOOG
MO
AMZN
AEE
AAL
AEP
and so on ...............
I'm working on a scraper using Mechanize and Beautiful Soup in Python and for some reason redirects aren't working. Here's my code (I apologize for naming my variables "thing" and "stuff"; I don't normally do that, trust me):
stuff = soup.find('div', attrs={'class' : 'paging'}).ul.findAll('a', href=True)
for thing in stuff:
pageUrl = thing['href']
print pageUrl
req = mechanize.Request(pageUrl)
response = browser.open(req)
searchPage = response.read()
soup = BeautifulSoup(searchPage)
soupString = soup.prettify()
print soupString
Anyway, products on Kraft's website that have more than one page for search results display a link to go the next page(s). The source code lists, for example, this as the next page for Kraft's line of steak sauces and marinades, which redirects to this
Anyway, thing['href'] has the old link in it because it scrapes the web page for it; one would think that doing browser.open() on that link would cause mechanize to go to the new link and return that as a response. However, running the code gives this result:
http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2
Traceback (most recent call last):
File "C:\Development\eclipse\mobile development\Crawler\src\Kraft.py", line 58, in <module>
response = browser.open(req)
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 203, in open
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 255, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 408: Request Time-out
I get a time-out; I imagine it's because, for some reason, mechanize is looking for the old URL and isn't being redirected to the new one (I also tried this with urllib2 and received the same result). What's going on here?
Thanks for the help and let me know if you need any more information.
Update: Alright, I enabled logging; now my code reads:
req = mechanize.Request(pageUrl)
print logging.INFO
When I run it I get this:
url argument is not a URI (contains illegal characters) u'http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2'
20
Update 2 (which occurred while writing the first update): It turns out that it was the spaces in my string! All I had to do was this: pageUrl = thing['href'].replace(' ', "+") and it works perfectly.
Both urllib2 and mechanize openers include a handler for redirect responses by default (you can check looking at the handlers attribute), so I don't think the problem is that a redirect response isn't being correctly followed.
To troubleshoot the problem, you should capture the traffic in your web browser (in firefox, Live HTTP Headers and HttpFox are useful to do this) and compare it with the logs from your script (I'd recommend subclassing urllib2.BaseHandler to create your own handler to log all the information you need for every request and add the handler to your opener object using the add_handler method).