Error when scraping in Python, need to bypass

Error when scraping in Python, need to bypass - python

import requests
from bs4 import BeautifulSoup
import csv
from urlparse import urljoin
import urllib2
outfile = open("./battingall.csv", "wb")
writer = csv.writer(outfile)
base_url = 'http://www.baseball-reference.com'
player_url = 'http://www.baseball-reference.com/players/'
alphabet = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
players = 'shtml'
gamel = '&t=b&year='
game_logs = 'http://www.baseball-reference.com/players/gl.cgi?id='
years = ['2015','2014','2013','2012','2011','2010','2009','2008']
drounders = []
for dround in alphabet:
drounders.append(player_url + dround)
urlz = []
for ab in drounders:
data = requests.get(ab)
soup = BeautifulSoup(data.content)
for link in soup.find_all('a'):
if link.has_attr('href'):
urlz.append(base_url + link['href'])
yent = []
for ant in urlz:
for d in drounders:
for y in years:
if players in ant:
if len(ant) < 60:
if d in ant:
yent.append(game_logs + ant[44:-6] + gamel + y)
for j in yent:
try:
data = requests.get(j)
soup = BeautifulSoup(data.content)
table = soup.find('table', attrs={'id': 'batting_gamelogs'})
tablea = j[52:59]
tableb= soup.find("b", text='Throws:').next_sibling.strip()
tablec= soup.find("b", text='Height:').next_sibling.strip()
tabled= soup.find("b", text='Weight:').next_sibling.strip()
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
list_of_cells.append(tablea)
list_of_cells.append(j[len(j)-4:])
list_of_cells.append(tableb)
list_of_cells.append(tablec)
list_of_cells.append(tabled)
for cell in row.findAll('td'):
text = cell.text.replace(' ', '').encode("utf-8")
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
print list_of_rows
writer.writerows(list_of_rows)
except (AttributeError,NameError):
pass
When I run this code to get gamelog batting data I keep getting an error:
Traceback (most recent call last):
File "battinggamelogs.py", line 44, in <module>
data = requests.get(j)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/requests/api.py", line 65, in get
return request('get', url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/requests/api.py", line 49, in request
response = session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py", line 461, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/adapters.py", line 415, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))
I need a way to bypass this error to keep going. I think the reason the error comes up because there is no table to get data from.

You can wrap your requests.get() block in a try/except. You need to catch the requests.exceptions.ConnectionError that is being generated.
for ab in drounders:
try:
data = requests.get(ab)
soup = BeautifulSoup(data.content)
for link in soup.find_all('a'):
if link.has_attr('href'):
urlz.append(base_url + link['href'])
except requests.exceptions.ConnectionError:
pass
This is occurring because the connection, itself, has a problem, not because there is no data in the table. You aren't even getting that far.
Note: This is completely eating the exception by simply using pass (as you are also doing later in the code block). It may be better to do something like this:
except requests.exceptions.ConnectionError:
print("Failed to open {}".format(ab))
This will provide you with a message on the console of what URL is failing.

Related

No Schema Supplied when Completing Loop through List of URLs from File

I'm working on a web scraping project with BeautifulSoup and at one step in it I need to compile a list of links off of another list of links which I have saved to a file. The loop seems to run fine until it gets to the last line of the file, at which point it will throw an error requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?. Full code and traceback below
Does this have to do with the fact that python is reading each row in my .txt file as a list? I also tried only having 1 for loop like
for link in season_links:
response_loop = requests.get(link[0])
But it didn't address the error.
Here is my code:
Contents of file:
https://rugby.statbunker.com/competitions/LastMatches?comp_id=98&limit=10&offs=UTC
https://rugby.statbunker.com/competitions/LastMatches?comp_id=99&limit=10&offs=UTC
# for reading season links from file
season_links = []
season_links_file = codecs.open('season_links_unpag_tst2.txt', 'r')
for line in season_links_file:
stripped_line = line.strip()
line_list = stripped_line.split()
season_links.append(line_list)
season_links_file.close()
print('Season links file read complete' + '\n')
print(season_links)
# handling for pagination within each season
for link in season_links:
t0 = time.time()
for item in link: # for some reason it reads each row in my .txt as a list, so I have to loop over it again
response_loop = requests.get(item)
html_loop = response_loop.content
soup_loop = BeautifulSoup(html_loop, 'html.parser')
for p in soup_loop.find_all('p', text='›'):
season_links.append(p.find_parent('a').get('href'))
print('Season link: ' + item)
response_delay = time.time() - t0
print('Loop duration: ' + str(response_delay))
time.sleep(4*response_delay)
print('Sleep: ' + str(response_delay*4) + '\n')
Traceback
Season link: https://rugby.statbunker.com/competitions/LastMatches?comp_id=1&limit=10&offs=UTC
Loop duration: 2.961906909942627
Sleep: 11.847627639770508
Season link: https://rugby.statbunker.com/competitions/LastMatches?comp_id=103&limit=10&offs=UTC
Loop duration: 1.6234941482543945
Sleep: 6.493976593017578
Traceback (most recent call last):
File "/Users/claycrosby/Desktop/coding/projects/gambling/scraper/sb_compile_games.py", line 103, in <module>
response_loop = requests.get(item)
File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/sessions.py", line 516, in request
prep = self.prepare_request(req)
File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/sessions.py", line 449, in prepare_request
p.prepare(
File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/models.py", line 314, in prepare
self.prepare_url(url, params)
File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/models.py", line 388, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?
[Finished in 23.3s with exit code 1]
EDIT: I have tried printing each item and I find there's a 3rd one that comes out just called h. There is no whitespace or h in my file though

The issue stemmed from the fact that I was trying to append to the original list from the loop. I used different lists and it processed without an error
# for reading season links from file
season_links_unpag = []
season_links_file = codecs.open('season_links_unpag_tst2.txt', 'r')
for line in season_links_file:
stripped_line = line.strip()
line_list = stripped_line.split()
season_links_unpag.append(line_list)
season_links_file.close()
print('Season links file read complete' + '\n')
print(season_links_unpag)
# handling for pagination within each season
season_links = []
for link in season_links_unpag:
t0 = time.time()
for item in link:
print(item)
response_loop = requests.get(item)
html_loop = response_loop.content
soup_loop = BeautifulSoup(html_loop, 'html.parser')
for p in soup_loop.find_all('p', text='›'):
season_links.append(p.find_parent('a').get('href'))
print('Season link: ' + item)
response_delay = time.time() - t0
print('Loop duration: ' + str(response_delay))
time.sleep(4*response_delay)
print('Sleep: ' + str(response_delay*4) + '\n')

How to do nested link scraping using BeautifulSoup and requests

I am trying to scrape the texts of all of the episodes of all of the TV Series in a webpage. The entire thing is nested hence it goes through 3 webpages before finding the list of links.It is showing some error which I have pasted below.
import requests
import bs4 as bs
urls='http://dl5.lavinmovie.net/Series/'
url=requests.get(urls).text
soup=bs.BeautifulSoup(url,'lxml')
title=soup.find_all('a')
ur=[""]
names=[""]
season=[""]
quality=[""]
for i in title:
# names.append(i.text)
urlss=urls+i.text+"/"
urla=requests.get(urls).text
soupp=bs.BeautifulSoup(urla,'lxml')
ur=soupp.find_all('a')
for i in ur:
# names.append(i.text)
urls=urls+i.text+"/"
urla=requests.get(urls).text
soupp=bs.BeautifulSoup(urla,'lxml')
ur=soupp.find_all('a')
for i in ur:
# quality.append(i.text)
urls=urls+i.text+"/"
urla=requests.get(urls).text
soupp=bs.BeautifulSoup(urla,'lxml')
ur=soupp.find_all('a')
for i in ur:
print(i.text)
Traceback (most recent call last):
File "C:\Users\Vedant Mamgain\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 603, in urlopen
chunked=chunked)
File "C:\Users\Vedant Mamgain\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "C:\Users\Vedant Mamgain\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\Vedant Mamgain\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1321, in getresponse
response.begin()
File "C:\Users\Vedant Mamgain\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 296, in begin
version, status, reason = self._read_status()
File "C:\Users\Vedant Mamgain\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 257, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "C:\Users\Vedant Mamgain\AppData\Local\Programs\Python\Python37\lib\socket.py", line 589, in readinto
return self._sock.recv_into(b)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
During handling of the above exception, another exception occurred:

Try using this, it worked for me:
import requests
import bs4 as bs
names = list()
name_links = list()
base_url = 'http://dl5.lavinmovie.net/Series/'
final_list = list()
soup = bs.BeautifulSoup(requests.get(base_url).text, 'lxml')
title = soup.find_all('a')
for link in title[1:]:
names.append(link.text)
current_link = link['href']
print(link.text)
name_links.append(str(current_link))
# get seasons
soup = bs.BeautifulSoup(requests.get(base_url + current_link).text, 'lxml')
title = soup.find_all('a')
for link in title[1:]:
season_link = link['href']
# get quality of the seasons
soup = bs.BeautifulSoup(requests.get(base_url + current_link +season_link).text, 'lxml')
title = soup.find_all('a')
for link in title[1:]:
quality_link = link['href']
# get list of episodes
soup = bs.BeautifulSoup(requests.get(base_url + current_link + season_link + quality_link).text, 'lxml')
title = soup.find_all('a')
for link in title[1:]:
episode_link = link['href']
final_list.a
Check if this works for you.

import requests
import bs4 as bs
urls = 'http://dl5.lavinmovie.net/Series/'
url = requests.get(urls).text
soup = bs.BeautifulSoup(url, 'lxml')
title = soup.find_all('a')
for i in title:
if(i.text != '../' and ".mp4" not in i.text):
urll = urls+i.text
# arr.append(i.text)
urll1 = requests.get(urll).text
soupp1 = bs.BeautifulSoup(urll1, 'lxml')
season = soupp1.find_all('a')
print(i.text)
for j in season:
if(j.text != '../'and ".mp4" not in j.text):
urlla = urll+j.text
urll2 = requests.get(urlla).text
soupp2 = bs.BeautifulSoup(urll2, 'lxml')
quality = soupp2.find_all('a')
print(j.text)
for k in quality:
if(k.text != '../' and ".mp4" not in k.text):
urllb = urlla+k.text
urll3 = requests.get(urllb).text
soupp3 = bs.BeautifulSoup(urll3, 'lxml')
episode = soupp3.find_all('a')
print(k.text)
for m in episode:
if(m.text != '../' and ".mp4" not in m.text):
print(m.text)
I have solved the problem myself as well.Thanks to everyone who helped.

How to fix "cannot serialize '_io.BufferedReader' object" while multiprocessing?

I get the following error when trying to parse a large number of web pages from a website : "Reason: 'TypeError("cannot serialize '_io.BufferedReader' object",)'. How can I fix it ?
full error message is :
File "main.py", line 29, in <module>
records = p.map(defs.scrape,state_urls)
File "C:\Users\Utilisateur\Anaconda3\lib\multiprocessing\pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\Utilisateur\Anaconda3\lib\multiprocessing\pool.py", line 644, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x0000018DD1C3D828>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object",)'
I browsed through some of the answers for similar questions here, namely this one (multiprocessing.pool.MaybeEncodingError: Error sending result: Reason: 'TypeError("cannot serialize '_io.BufferedReader' object",)') but I don't think I'm running into the same issue, as I don't handle files directly in the scrape function.
I tried modifying the scrape function so it returned a string and not a list (don't know why I did that) but I didn't work.
From the main.py file :
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from multiprocessing import Pool
import codecs
import defs
if __name__ == '__main__':
filename = "some_courts_test.csv"
# not the actual values
courts = ["blabla", "blablabla", "blablabla","blabla"]
client = defs.init_client()
i = 1
# scrapes the data from the website and puts it into a csv file
for court in courts:
records = []
records_string =""
print("creating a file for the court of : "+court)
f = defs.init_court_file(court)
print("generating urls for the court of "+court)
state_urls = defs.generate_state_urls(court)
for url in state_urls:
print(url)
print("scraping creditors from : "+court)
p = Pool(10)
records = p.map(defs.scrape,state_urls)
records_string = ''.join(records[1])
p.terminate()
p.join()
for r in records_string:
f.write(r)
records = []
f.close()
from the defs file:
def scrape(url):
data = []
row_string = ' '
final_data = []
final_string = ' '
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
table = page_soup.find("table", {"class":"table table-striped"})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.replace(',',' ') for ele in cols] #cleans it up
for ele in cols:
if ele:
data.append(ele)
data.append(',')
data.append('\n')
return(data)

Python requests.get returning ValueError when looping over URLs

I'm writing a script to do the following:
Ingest a csv file
Loop through values in a url column
Return status codes for each url field
My data is coming from a csv file that I've written. The url field contains a string with 1 or 2 urls to check.
The CSV file is structured as follows:
id,site_id,url_check,js_pixel_json
12187,333304,"[""http://www.google.com"", ""http://www.facebook.com""]",[]
12187,333304,"[""http://www.google.com""]",[]
I have a function that loops through every column correctly however when it I attempt to pull the status code, I'm getting a
Traceback (most recent call last):
File "help.py", line 29, in <module>
loopUrl(inputReader)
File "help.py", line 26, in loopUrl
urlStatus = requests.get(url)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/sessions.py", line 498, in request
prep = self.prepare_request(req)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/sessions.py", line 441, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/models.py", line 309, in prepare
self.prepare_url(url, params)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/models.py", line 375, in prepare_url
scheme, auth, host, port, path, query, fragment = parse_url(url)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/util/url.py", line 185, in parse_url
host, url = url.split(']', 1)
ValueError: not enough values to unpack (expected 2, got 1)
Here is my code:
import requests
import csv
input = open('stackoverflow_help.csv')
inputReader = csv.reader(input)
def loopUrl(inputReader):
pixelCheck = []
for row in inputReader:
checkUrl = row[2]
if inputReader.line_num == 1:
continue #skip first row
elif checkUrl == '[]':
continue
elif checkUrl == 'NULL':
continue
urlList = str(checkUrl)
for url in urlList:
urlStatus = requests.get(url)
print(urlStatus.response_code)
loopUrl(inputReader)
The issue traces back to the module and I believe something is happening with the loop which is causing the error.

["http://www.google.com", "http://www.facebook.com"] is a string, not a list. You are iterating it character by character, thus giving you the error above.
You need to do a safe evaluation of the list to get the list of URLs instead of strings.
Example:
>>> import ast
>>> x = u'[ "A","B","C" , " D"]'
>>> x = ast.literal_eval(x)
>>> x
['A', 'B', 'C', ' D']
>>> x = [n.strip() for n in x]
>>> x
['A', 'B', 'C', 'D']
Reference: Convert string representation of list to list
In your code it would be:
urlList = ast.literal_eval(checkUrl) # not str(checkUrl)
for url in urlList:
urlStatus = requests.get(url)
print(urlStatus.response_code)

Need to clean this up a bit, but should get you going:
import requests
import csv
import ast
input = open('stackoverflow_help.csv')
inputReader = csv.reader(input)
def loopUrl(inputReader):
pixelCheck = []
for row in inputReader:
if inputReader.line_num == 1:
continue #skip first row
checkUrl = row[2]
try:
checkUrl = ast.literal_eval(checkUrl)
except:
continue
if checkUrl == []:
continue
elif checkUrl == 'NULL':
continue
for url in checkUrl:
urlStatus = requests.get(url)
print(urlStatus.status_code)
loopUrl(inputReader)
Output:
200
200
200

Python urllib.urlopen IOError

So I have the following lines of code in a function
sock = urllib.urlopen(url)
html = sock.read()
sock.close()
and they work fine when I call the function by hand. However, when I call the function in a loop (using the same urls as earlier) I get the following error:
> Traceback (most recent call last):
File "./headlines.py", line 256, in <module>
main(argv[1:])
File "./headlines.py", line 37, in main
write_articles(headline, output_folder + "articles_" + term +"/")
File "./headlines.py", line 232, in write_articles
print get_blogs(headline, 5)
File "/Users/michaelnussbaum08/Documents/College/Sophmore_Year/Quarter_2/Innovation/Headlines/_code/get_content.py", line 41, in get_blogs
sock = urllib.urlopen(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 87, in urlopen
return opener.open(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 203, in open
return getattr(self, name)(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 314, in open_http
if not host: raise IOError, ('http error', 'no host given')
IOError: [Errno http error] no host given
Any ideas?
Edit more code:
def get_blogs(term, num_results):
search_term = term.replace(" ", "+")
print "search_term: " + search_term
url = 'http://blogsearch.google.com/blogsearch_feeds?hl=en&q='+search_term+'&ie=utf-8&num=10&output=rss'
print "url: " +url
#error occurs on line below
sock = urllib.urlopen(url)
html = sock.read()
sock.close()
def write_articles(headline, output_folder, num_articles=5):
#calls get_blogs
if not os.path.exists(output_folder):
os.makedirs(output_folder)
output_file = output_folder+headline.strip("\n")+".txt"
f = open(output_file, 'a')
articles = get_articles(headline, num_articles)
blogs = get_blogs(headline, num_articles)
#NEW FUNCTION
#the loop that calls write_articles
for term in trend_list:
if do_find_max == True:
fill_search_term(term, output_folder)
headlines = headline_process(term, output_folder, max_headlines, do_find_max)
for headline in headlines:
try:
write_articles(headline, output_folder + "articles_" + term +"/")
except UnicodeEncodeError:
pass

I had this problem when a variable I was concatenating with the url, in your case search_term
url = 'http://blogsearch.google.com/blogsearch_feeds?hl=en&q='+search_term+'&ie=utf-8&num=10&output=rss'
had a newline character at the end. So make sure you do
search_term = search_term.strip()
You might also want to do
search_term = urllib2.quote(search_term)
to make sure your string is safe for a url

use urllib2 instead if you don't want to handle reading on a per block basis yourself.
This probably does what you expect.
import urllib2
req = urllib2.Request(url='http://stackoverflow.com/')
f = urllib2.urlopen(req)
print f.read()

In your function's loop, right before the call to urlopen, perhaps put a print statement:
print(url)
sock = urllib.urlopen(url)
This way, when you run the script and get the IOError, you will see the url which is causing the problem. The error "no host given" can be replicated if url equals something like 'http://'...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error when scraping in Python, need to bypass - python

Related

No Schema Supplied when Completing Loop through List of URLs from File

How to do nested link scraping using BeautifulSoup and requests

How to fix "cannot serialize '_io.BufferedReader' object" while multiprocessing?

Python requests.get returning ValueError when looping over URLs

Python urllib.urlopen IOError

Categories

Resources