Program runs only with None results

Program runs only with None results - python

My problem is: my program runs only with None results. I guess there is a problem with the data parameters in my program.
from lxml import html
import requests
etree = html.etree
class News(object):
def __init__(self):
self.url ='https://www.chinatimes.com/newspapers/260118'
self.headers ={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"}
def get_data(self,url):
response =requests.get(url,headers=self.headers)
return response.content
def parse_data(self, data):
# 创建 element对象
data = data.decode()
html = etree.HTML(data)
el_list = html.xpath('/html/body/div[2]/div/div[2]/div/section/ul/li/div/div/div/h3/a/font')
data_list = []
for el in el_list:
temp = {}
temp['title'] = el
temp['link'] = 'https://www.chinatimes.com' + el.xpath("./#href")[0]
data_list.append(temp)
try:
# 获取 下一页的url
next_url = 'https://www.chinatimes.com' + html.xpath('/html/body/div[2]/div/div[2]/div/section/nav/ul/li[7]/a/#href')[0]
except:
next_url = None
return data_list, next_url
def save_data(self, data_list):
for data in data_list:
print(data)
def run(self):
# url
next_url = self.url
while True:
data = self.get_data(next_url)
data_list, next_url = self.parse_data(data)
self.save_data(data_list)
print(next_url)
if next_url == None:
break
if __name__ == '__main__':
news =News()
news.run()
I use Google Chrome, and my XPATH should be correct. I think there is a problem with my data parameter, but I am not sure about it. I hope someone can help me see it. thank you very much. Because when the problem was submitted, the system always said that most of the files I submitted were code, so I can only make up some words, I hope I can submit them.
I thought about it again. It would be better to start from here. Maybe my position is correct, but I don't know how to solve the problem here: data = data.decode()

Related

windows web-scraping encoding error: a non-major undergraduate, don't know what to do

this is the first time I use this platform so I might not know the rule here. But here is my proble:
I signed up for a course lecturing web-scraping using python, the teacher uses MacOS but I use windows.
he gave us this py article to stretch information from a job website, and it worked perfectly fine in his computer- not mine.
The error code is this:
UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 3452: illegal multibyte sequence
Process finished with exit code 1
I googled this error code several times with different lines of code applied, but the same problem continues to pop up. I barely know anything about computer programming, as a political science student, so I suspect if I put the correct code in the wrong place (I put them at the bottom). Here are some examples:
`print(r.text['response'][i].replace(u'\xa0 ', u' '))`
`self.file.write(content.replace(u'\xa0', u''))`
self.file.write(content.encode("gbk", 'ignore').decode("gbk", "ignore"))
self.file = open('biaobai.json', 'w', encoding="utf-8")
self.file.write(content)
The original code is as follows:
# -*- coding:utf-8 _*-
class HKJob(object):
def __init__(self, page):
# init logging
logging.basicConfig(filename='HKJob.log', level=logging.INFO)
self.page = page
self.url = 'https://hk.jobsdb.com/hk/en/Search/FindJobs?JSRV=1&page=' + str(page)
print('Working on page ' + str(page))
self.header ``{'accept':text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
# 'cookie': '__cfduid=d4a5e5025c307a0899cfb595f591de2321591411627; __cfduid=d4a5e5025c307a0899cfb595f591de2321591411627; azTest=j%3A%7B%22id%22%3A%22af644b3a-7e57-4b74-a6ac-7081d2f3c78e%22%2C%22createdAt%22%3A%222020-06-06T00%3A46%3A43.809Z%22%7D; ABNEWHP=1607; showNewHomePage=B; isSmartSearch=A; sol_id=74034b65-7d37-43de-94e1-ea2caf21d6dc; s_fid=47BC437B5DDD3E34-38816260845AF8DB; _gcl_au=1.1.3255028.1591411635; s_vi=[CS]v1|2F6D81D905159C2C-60000A22221F80DA[CE]; intercom-id-o7zrfpg6=489c6e71-c371-4a82-9478-26f4afa57525; _fbp=fb.1.1591411666764.1130609656; s_cc=true; _hjid=533d3b03-b258-4c44-b361-6b35adae2f01; RecentSearch=%7B%22Keyword%22%3A%5B%22data%22%5D%7D; ABSSRPGroup=B; ABHPGroup=B; ABJDGroup=B; NSC_wjq_kpctec.dpn_ttm=30dfa3dbcdc234aa83959421623161a99f20596b3402b72e8d156af38c20a8f9e3dee830; ABIDPGroup=1; sol_id_pre_stored=74034b65-7d37-43de-94e1-ea2caf21d6dc; _gid=GA1.2.818155812.1593870501; intercom-session-o7zrfpg6=; ASP.NET_SessionId=rronrppqvou1gjrlcpebyl5g; s_sq=%5B%5BB%5D%5D; ABSSRP=1659; sol_id=74034b65-7d37-43de-94e1-ea2caf21d6dc; utag_main=v_id:017287866ed70010221858819c4c03078007207000bd0$_sn:5$_se:2$_ss:0$_st:1593915998514$ses_id:1593914193762%3Bexp-session$_pn:1%3Bexp-session; _ga=GA1.1.960239391.1591411635; _ga_88RH71GXX9=GS1.1.1593914191.5.1.1593914314.0',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
def send_get_request(self, url):
# 3. Receive Response
r = requests.get(url, headers=self.headers)
if r.text:
response = r.text
# print('get response success')
return response
else:
print('get response fail')
return ''
def extract_info_urls(self, response):
# The response is not json this time. We need to extract information from html file.
# Therefore, we need to import a new module -- lxml
# Please review how to install new module using conda, we discussed it in the first session.
raw_tree = etree.HTML(response)
# Here we first extract the urls of the detailed info pages
job_urls = raw_tree.xpath(
'//*[#id="contentContainer"]/div[2]/div/div/div[2]/div/div/div[3]/div/div/div/div/div/article/div/div/div[1]/div[1]/div/div[1]/div/div/div[2]/div[1]/div/h1/a/#href')
return job_urls
# 4. Extract Information
def extract_information(self, response):
raw_tree = etree.HTML(response)
dict_result = {}
dict_result['job_name'] = raw_tree.xpath(
'//*[#id="contentContainer"]/div/div[1]/div[2]/div[1]/div/div/div[1]/div/div/div[2]/div/div/div/div[1]/h1/text()')[
0]
dict_result['company_name'] = raw_tree.xpath(
'//*[#id="contentContainer"]/div/div[1]/div[2]/div[1]/div/div/div[1]/div/div/div[2]/div/div/div/div[2]/span/text()')[
0]
try:
dict_result['company_img'] = raw_tree.xpath(
'//*[#id="contentContainer"]/div/div[1]/div[2]/div[1]/div/div/div[1]/div/div/div[1]/div/img/#src')[0]
except IndexError:
dict_result['company_img'] = ''
try:
dict_result['work_place'] = raw_tree.xpath(
'//*[#id="contentContainer"]/div/div[1]/div[2]/div[1]/div/div/div[2]/div/div/div/div[1]/div/a/span/text()')[
0]
except IndexError:
dict_result['work_place'] = ''
try:
dict_result['salary'] = raw_tree.xpath(
'//*[#id="contentContainer"]/div/div[1]/div[2]/div[1]/div/div/div[2]/div/div/div/div[2]/span/text()')[0]
except IndexError:
dict_result['salary'] = ''
dict_result['posted_time'] = raw_tree.xpath(
'//*[#id="contentContainer"]/div/div[1]/div[2]/div[1]/div/div/div[2]/div/div/div/div[3]/span/text()')[0]
dict_result['job_details'] = '\n'.join(
raw_tree.xpath('//*[#id="contentContainer"]/div/div[2]/div/descendant::*/text()'))
dict_result['page'] = self.page
return dict_result
def save_information(self, raw_json):
with open('HKJob_result.json', 'a+') as out_f:
out_f.write(json.dumps(raw_json, ensure_ascii=False) + '\n')
def run(self):
response = self.send_get_request(self.url)
job_urls = self.extract_info_urls(response)
for url in job_urls:
try:
print('Scraping url ' + url)
info_response = self.send_get_request(url)
raw_json = self.extract_information(info_response)
raw_json['job_url'] = url
# self.save_information_json(raw_json)
self.save_information(raw_json)
except IndexError as e:
print('There are something wrong when phrasing ' + url)
logging.info(str(e) + ' ' + url)

Which python version is the teacher using, and which python version are you using?
Try using python version > 3.8

Can't force a script to try few times when it fails to grab title from a webpage

I've crated a script to get the title of different shops from some identical webpages. The script is doing fine.
I'm now trying to create a logic within the script to let it try few times if somehow it fails to grab the titles from those pages.
As a test, if I define the line with selector otherwise, as in name = soup.select_one(".sales-info > h").text, the script will go for looping indefinitely.
I've tried so far with:
import requests
from bs4 import BeautifulSoup
links = (
'https://www.yellowpages.com/san-francisco-ca/mip/nizarios-pizza-481135933',
'https://www.yellowpages.com/nationwide/mip/credo-452182701'
)
def get_title(s,link):
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
try:
name = soup.select_one(".sales-info > h1").text
except Exception:
print("trying again")
return get_title(s,link) #I wish to bring about any change here to let the script try few times other than trying indefinitely
return name
if __name__ == '__main__':
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
for link in links:
print(get_title(s,link))
How can I let the script try few times when it fails to grab title from a webpage?
PS The webpages that I've used within the script are placeholders.

I added some parameters to specify number of retries, sleep between retries and default value to return if everything fails:
import time
import requests
from bs4 import BeautifulSoup
links = (
'https://www.webscraper.io/test-sites/e-commerce/allinone',
'https://www.webscraper.io/test-sites/e-commerce/static'
)
def get_title(s, link, retries=3, sleep=1, default=''):
"""
s -> session
link -> url
retries -> number of retries before return default value
sleep -> sleep between tries (in seconds)
default -> default value to return if every retry fails
"""
name, current_retry = default, 0
while current_retry != retries:
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
try:
name = soup.select_one("h8").text
except Exception:
print("Retry {}/{}".format(current_retry + 1, retries))
time.sleep(sleep)
current_retry += 1
return name
if __name__ == '__main__':
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
for link in links:
print(get_title(s, link, 3, 1, 'Failed to grab {}'.format(link)))
Prints:
Retry 1/3
Retry 2/3
Retry 3/3
Failed to grab https://www.webscraper.io/test-sites/e-commerce/allinone
Retry 1/3
Retry 2/3
Retry 3/3
Failed to grab https://www.webscraper.io/test-sites/e-commerce/static

I think the simplest way would be to switch from recursion to a loop:
def get_title(s,link):
failed = 0
while failed < 5:
try:
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
name = soup.select_one(".sales-info > h1").text
return name
except Exception: # Best to specify which one, by the way
failed += 1
print('Failed too many times')
return None

You can try to use any retrying library, such as tenacity, backoff. Notice that these libraries usually function as decorators and your function will simply need to make the import and then call the decorator in a similar fashion to:
import requests
from bs4 import BeautifulSoup
from tenacity import retry ###or import backoff
...
#retry ###or #backoff.on_exception(backoff.expo, requests.exceptions.RequestException)
def get_title(s, link, retries=3, sleep=1, default=''):
...

You can achieve the same in different ways. Here is another you might wanna consider trying:
import time
import requests
from bs4 import BeautifulSoup
links = [
"https://www.yellowpages.com/san-francisco-ca/mip/nizarios-pizza-481135933",
"https://www.yellowpages.com/nationwide/mip/credo-452182701"
]
def get_title(s,link,counter=0):
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
try:
name = soup.select_one(".sales-info > h").text
except Exception:
if counter<=3:
time.sleep(1)
print("done trying {} times".format(counter))
counter += 1
return get_title(s,link,counter)
else:
return None
return name
if __name__ == '__main__':
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
for link in links:
print(get_title(s,link))

Resolve Python Module Error To Enable Web Scraping script?

Using stackoverflow for the first time trying to figure out how to scrape Yelp data and having a hard time. Have set up LXML, beautiful soup, requests, PIP, Python and have added these to the path in system variables yet I am still getting the error below when I try to run code below. Any suggestions?
File "test2.py", line 4, in
from exceptions import ValueError
ModuleNotFoundError: No module named 'exceptions'
from lxml import html
import json
import requests
from exceptions import ValueError
import re, urllib
import urllib3
import argparse
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
from requests.packages.urllib3.exceptions import InsecureRequestWarning
import time
from concurrent.futures import ThreadPoolExecutor
import sys
from threading import Thread
import os
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
#[#'https://www.yelp.com/biz/kdb-kitchen-den-bar-long-beach',
yelp_urls =['https://www.yelp.com/biz/the-atlas-room-washington','https://www.yelp.com/biz/the-rack-brandon','https://www.yelp.com/biz/payard-p%C3%A2tisserie-and-bistro-new-york-2','https://www.yelp.com/biz/maison-giraud-pacific-palisades','https://www.yelp.com/biz/saltbox-san-diego','https://www.yelp.com/biz/carmichaels-chicago-steak-house-chicago','https://www.yelp.com/biz/black-eyed-pea-restaurant-houston-6','https://www.yelp.com/biz/perfecto-mundo-latin-fusion-bistro-commack','https://www.yelp.com/biz/smittys-bbq-boyd','https://www.yelp.com/biz/reston-kabob-reston','https://www.yelp.com/biz/bookmark-cafe-largo','https://www.yelp.com/biz/the-tin-angel-pittsburgh','https://www.yelp.com/biz/briantos-original-hoagies-orlando','https://www.yelp.com/biz/freeway-diner-woodbury','https://www.yelp.com/biz/river-gods-cambridge','https://www.yelp.com/biz/golan-kosher-restaurant-north-hollywood-2','https://www.yelp.com/biz/city-hall-restaurant-new-york-2','https://www.yelp.com/biz/empire-pizza-and-grill-west-chester','https://www.yelp.com/biz/cityzen-washington-2','https://www.yelp.com/biz/three-degrees-los-gatos','https://www.yelp.com/biz/applebees-grill-bar-quakertown','https://www.yelp.com/biz/johnny-carinos-covina','https://www.yelp.com/biz/buffet-de-la-gare-hastings-hdsn','https://www.yelp.com/biz/continental-food-management-la-mirada','https://www.yelp.com/biz/elephant-bar-restaurant-peoria','https://www.yelp.com/biz/sullivans-steakhouse-denver','https://www.yelp.com/biz/yucatan-liquid-stand-coppell','https://www.yelp.com/biz/tomato-pie-morristown','https://www.yelp.com/biz/willett-house-port-chester','https://www.yelp.com/biz/thai-corner-san-antonio-2','https://www.yelp.com/biz/silkes-american-grill-mesa','https://www.yelp.com/biz/t-mex-cantina-fort-lauderdale-2','https://www.yelp.com/biz/casa-oaxaca-washington','https://www.yelp.com/biz/wings-on-wheels-hebron','https://www.yelp.com/biz/siris-thai-french-cuisine-cherry-hill','https://www.yelp.com/biz/nightwood-chicago','https://www.yelp.com/biz/cafe-gallery-burlington','https://www.yelp.com/biz/the-hurricane-caf%C3%A9-seattle-2','https://www.yelp.com/biz/231-ellsworth-san-mateo','https://www.yelp.com/biz/la-marmite-williston-park','https://www.yelp.com/biz/the-river-house-palm-beach-gardens-2','https://www.yelp.com/biz/langermanns-baltimore','https://www.yelp.com/biz/del-friscos-grille-phoenix','https://www.yelp.com/biz/carrows-family-restaurant-antioch','https://www.yelp.com/biz/minerva-fine-indian-herndon-va-herndon-5','https://www.yelp.com/biz/the-mason-bar-dallas','https://www.yelp.com/biz/la-cote-cafe-and-wine-bar-seattle','https://www.yelp.com/biz/vareli-new-york','https://www.yelp.com/biz/wendys-wixom','https://www.yelp.com/biz/lanterna-tuscan-bistro-nyack','https://www.yelp.com/biz/yo-taco-duxbury','https://www.yelp.com/biz/bombay-palace-new-york','https://www.yelp.com/biz/cafe-buonaros-naperville','https://www.yelp.com/biz/ponti-seafood-grill-seattle-3','https://www.yelp.com/biz/bill-johnsons-big-apple-restaurants-phoenix-5','https://www.yelp.com/biz/by-word-of-mouth-oakland-park','https://www.yelp.com/biz/anna-maries-pizza-and-restaurant-wharton','https://www.yelp.com/biz/dierdorf-and-harts-steakhouse-saint-louis','https://www.yelp.com/biz/wine-5-cafe-las-vegas','https://www.yelp.com/biz/ernies-restaurant-plymouth','https://www.yelp.com/biz/next-door-pizza-and-pub-lees-summit','https://www.yelp.com/biz/lannys-alta-cocina-mexicana-fort-worth','https://www.yelp.com/biz/jalisco-mexican-restaurant-eastlake','https://www.yelp.com/biz/clio-boston','https://www.yelp.com/biz/uncommon-grounds-aliquippa','https://www.yelp.com/biz/uozumi-restaurant-palmdale','https://www.yelp.com/biz/enzos-pizza-matawan','https://www.yelp.com/biz/the-pointe-cafe-south-san-francisco','https://www.yelp.com/biz/captains-restaurant-and-seafood-market-florida-city','https://www.yelp.com/biz/le-perigord-new-york-4','https://www.yelp.com/biz/i-love-thai-arlington','https://www.yelp.com/biz/bistro-44-bedford','https://www.yelp.com/biz/ritters-marietta','https://www.yelp.com/biz/rouge-et-blanc-new-york','https://www.yelp.com/biz/assembly-steak-house-and-seafood-grill-englewood-cliffs-2','https://www.yelp.com/biz/american-turkish-restaurant-fort-lauderdale','https://www.yelp.com/biz/r-and-r-bar-b-que-and-catering-service-missouri-2','https://www.yelp.com/biz/sushi-land-long-beach','https://www.yelp.com/biz/longshots-sports-bar-waretown','https://www.yelp.com/biz/salt-creek-barbeque-glendale-heights','https://www.yelp.com/biz/pizza-market-breese','https://www.yelp.com/biz/john-qs-steakhouse-cleveland','https://www.yelp.com/biz/bistro-n-boca-raton-2','https://www.yelp.com/biz/samanthas-restaurant-silver-spring-2','https://www.yelp.com/biz/baha-brothers-sandbar-grill-taunton-3','https://www.yelp.com/biz/cafe-cortina-farmington-hills-5','https://www.yelp.com/biz/big-beaver-tavern-troy','https://www.yelp.com/biz/hogans-restaurant-bloomfield-hills','https://www.yelp.com/biz/the-copper-monkey-beaverton','https://www.yelp.com/biz/clement-street-bar-and-grill-san-francisco','https://www.yelp.com/biz/pepin-scottsdale','https://www.yelp.com/biz/village-belle-philadelphia','https://www.yelp.com/biz/sweet-woodruff-san-francisco','https://www.yelp.com/biz/siam-marina-tinley-park','https://www.yelp.com/biz/luigis-italian-restaurant-centennial-2','https://www.yelp.com/biz/smokin-wills-barbecue-roselle','https://www.yelp.com/biz/voltaire-restaurant-scottsdale','https://www.yelp.com/biz/jus-cookins-restaurant-lakewood-2','https://www.yelp.com/biz/pegs-countryside-cafe-hamel','https://www.yelp.com/biz/rays-grill-fulshear','https://www.yelp.com/biz/cafe-zalute-rosemont','https://www.yelp.com/biz/guard-house-inn-gladwyne','https://www.yelp.com/biz/road-runner-grand-canyon-las-vegas-2','https://www.yelp.com/biz/garage-restaurant-and-cafe-new-york','https://www.yelp.com/biz/los-tapatios-cedar-hill','https://www.yelp.com/biz/chengdu-46-clifton','https://www.yelp.com/biz/moby-dick-house-of-kabob-fairfax','https://www.yelp.com/biz/natures-food-patch-clearwater','https://www.yelp.com/biz/taco-del-mar-hillsboro-3','https://www.yelp.com/biz/ms-tootsies-rbl-philadelphia','https://www.yelp.com/biz/the-big-c-athletic-club-concord','https://www.yelp.com/biz/west-hanover-pizzeria-hanover','https://www.yelp.com/biz/georges-pastaria-houston','https://www.yelp.com/biz/encuentro-oakland-3','https://www.yelp.com/biz/smokys-bbq-eldersburg','https://www.yelp.com/biz/ruby-tuesday-san-antonio','https://www.yelp.com/biz/saladworks-philadelphia-4','https://www.yelp.com/biz/captain-pizza-middleton','https://www.yelp.com/biz/bob-evans-fredericksburg-3','https://www.yelp.com/biz/frittata-clawson','https://www.yelp.com/biz/the-sandwich-spot-palm-springs','https://www.yelp.com/biz/freds-mexican-cafe-san-diego-4','https://www.yelp.com/biz/geordies-steak-phoenix-2','https://www.yelp.com/biz/five-guys-wayne-5','https://www.yelp.com/biz/zen-sushi-la-crescenta-2','https://www.yelp.com/biz/the-summit-steakhouse-aurora-2','https://www.yelp.com/biz/miramar-bistro-highwood','https://www.yelp.com/biz/mick-o-sheas-baltimore','https://www.yelp.com/biz/dennys-houston-30','https://www.yelp.com/biz/carls-jr-henderson-5','https://www.yelp.com/biz/mexican-town-restaurant-detroit','https://www.yelp.com/biz/sushi-roku-las-vegas','https://www.yelp.com/biz/giant-pizza-king-san-diego','https://www.yelp.com/biz/quiznos-brooklyn-6','https://www.yelp.com/biz/taco-bell-glen-ellyn','https://www.yelp.com/biz/las-tortas-locas-marietta','https://www.yelp.com/biz/smith-and-wollensky-las-vegas-2','https://www.yelp.com/biz/happy-garden-chinese-brighton','https://www.yelp.com/biz/urban-foodie-feed-store-college-park','https://www.yelp.com/biz/the-wolf-oakland','https://www.yelp.com/biz/scuzzis-italian-restaurant-san-antonio-4','https://www.yelp.com/biz/better-gourmet-health-kitchen-staten-island','https://www.yelp.com/biz/the-restaurant-and-cafe-warren','https://www.yelp.com/biz/mcdonalds-houston-214','https://www.yelp.com/biz/pyeong-chang-tofu-house-oakland','https://www.yelp.com/biz/maria-rosa-pizzeria-and-family-restaurant-flemington','https://www.yelp.com/biz/legends-sports-bar-and-grill-roseville-2','https://www.yelp.com/biz/villa-reale-pizzeria-and-restaurant-pittsburgh','https://www.yelp.com/biz/the-terrace-cafe-venice','https://www.yelp.com/biz/the-oval-room-washington-2','https://www.yelp.com/biz/high-point-coal-center','https://www.yelp.com/biz/j-and-s-montebello','https://www.yelp.com/biz/cheers-restaurant-and-bar-fort-lauderdale']
def parse_page(url):
# url = "https://www.yelp.com/biz/frances-san-francisco"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
response = requests.get(url, headers=headers, verify=False).text
parser = html.fromstring(response)
raw_name = parser.xpath("//h1[contains(#class,'page-title')]//text()")
raw_claimed = parser.xpath("//span[contains(#class,'claim-status_icon--claimed')]/parent::div/text()")
raw_reviews = parser.xpath("//div[contains(#class,'biz-main-info')]//span[contains(#class,'review-count rating-qualifier')]//text()")
raw_category = parser.xpath('//div[contains(#class,"biz-page-header")]//span[#class="category-str-list"]//a/text()')
hours_table = parser.xpath("//table[contains(#class,'hours-table')]//tr")
details_table = parser.xpath("//div[#class='short-def-list']//dl")
raw_map_link = parser.xpath("//a[#class='biz-map-directions']/img/#src")
raw_phone = parser.xpath(".//span[#class='biz-phone']//text()")
raw_address = parser.xpath('//div[#class="mapbox-text"]//div[contains(#class,"map-box-address")]//text()')
raw_wbsite_link = parser.xpath("//span[contains(#class,'biz-website')]/a/#href")
raw_price_range = parser.xpath("//dd[contains(#class,'price-description')]//text()")
raw_health_rating = parser.xpath("//dd[contains(#class,'health-score-description')]//text()")
rating_histogram = parser.xpath("//table[contains(#class,'histogram')]//tr[contains(#class,'histogram_row')]")
raw_ratings = parser.xpath("//div[contains(#class,'biz-page-header')]//div[contains(#class,'rating')]/#title")
working_hours = []
for hours in hours_table:
raw_day = hours.xpath(".//th//text()")
raw_timing = hours.xpath("./td//text()")
day = ''.join(raw_day).strip()
timing = ''.join(raw_timing).strip()
working_hours.append({day:timing})
info = []
for details in details_table:
raw_description_key = details.xpath('.//dt//text()')
raw_description_value = details.xpath('.//dd//text()')
description_key = ''.join(raw_description_key).strip()
description_value = ''.join(raw_description_value).strip()
info.append({description_key:description_value})
ratings_histogram = []
for ratings in rating_histogram:
raw_rating_key = ratings.xpath(".//th//text()")
raw_rating_value = ratings.xpath(".//td[#class='histogram_count']//text()")
rating_key = ''.join(raw_rating_key).strip()
rating_value = ''.join(raw_rating_value).strip()
ratings_histogram.append({rating_key:rating_value})
name = ''.join(raw_name).strip()
phone = ''.join(raw_phone).strip()
address = ' '.join(' '.join(raw_address).split())
health_rating = ''.join(raw_health_rating).strip()
price_range = ''.join(raw_price_range).strip()
claimed_status = ''.join(raw_claimed).strip()
reviews = ''.join(raw_reviews).strip()
category = ','.join(raw_category)
cleaned_ratings = ''.join(raw_ratings).strip()
if raw_wbsite_link:
decoded_raw_website_link = urllib.unquote(raw_wbsite_link[0])
website = re.findall("biz_redir\?url=(.*)&website_link",decoded_raw_website_link)[0]
else:
website = ''
if raw_map_link:
decoded_map_url = urllib.unquote(raw_map_link[0])
map_coordinates = re.findall("center=([+-]?\d+.\d+,[+-]?\d+\.\d+)",decoded_map_url)[0].split(',')
latitude = map_coordinates[0]
longitude = map_coordinates[1]
else:
latitude = ''
longitude = ''
if raw_ratings:
ratings = re.findall("\d+[.,]?\d+",cleaned_ratings)[0]
else:
ratings = 0
data={'working_hours':working_hours,
'info':info,
'ratings_histogram':ratings_histogram,
'name':name,
'phone':phone,
'ratings':ratings,
'address':address,
'health_rating':health_rating,
'price_range':price_range,
'claimed_status':claimed_status,
'reviews':reviews,
'category':category,
'website':website,
'latitude':latitude,
'longitude':longitude,
'url':url,
}
return data
def parse_reviews(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0 Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0.'}
response = requests.get(url, headers=headers, verify=False).text
parser = html.fromstring(response)
ratings_zipped = []
reviews = [x for x in parser.xpath("//div[contains(#class,'main-section')]//div[contains(#class,'review-list')]//div[contains(#class,'review')]//div[contains(#class,'review-content')]")]
for r in reviews:
date = r.xpath("./div[contains(#class,'biz-rating')]//span[contains(#class,'rating-qualifier')]/text()")[0].strip()
rating = r.xpath("./div[contains(#class,'biz-rating')]//div[contains(#class,'rating-large')]/#title")[0]
content = r.xpath("./p")[0].text_content()
ratings_zipped.append([date, rating, content])
print (len(ratings_zipped))
return ratings_zipped
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
def parse_pagination(url):
print (url)
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
response = requests.get(url, headers=headers, verify=False)
print (response)
parser = html.fromstring(response.text)
try:
results = (int(parser.xpath("//div[contains(#class,'page-of-pages')]//text()")[0].strip().split(' ').pop()))*20
except IndexError:
results = 20
print (results)
return results
def get_businesses_data(data):
businesses, failed_searches = [], []
start_time = time.time()
result = {}
for i,url in enumerate(data):
print ('Starting iteration: ', i)
result['url']= url
pagination = parse_pagination(url)
print ('Pagination: ', pagination)
info = parse_page(url)
result['info'] = info
_reviews = []
for v in xrange(0,pagination,20):
paginated_url = result['url'].split('?')[0] + '?start='+str(v)
print ('Scraping Reviews: ', paginated_url)
_reviews += parse_reviews(paginated_url)
time.sleep(.5)
result['scraped_reviews'] = _reviews
result['scraped_reviews_count'] = len(_reviews)
businesses.append(result)
print ('Success iteration: ', i)
# print ('Results: ', result)
print ('Num of reviews: ', str(len(_reviews)))
print('')
print ('Time Elapsed: ', str(time.time() - start_time))
return businesses
if __name__=="__main__":
index = 5
#0
size = 20
i = index*20
chunk = yelp_urls[i:i+size]
businesses = get_businesses_data(chunk)
with open ('results/run_3/output_{}.json'.format(i), 'w') as f:
json.dump(businesses,f)
'''

from exceptions import ValueError
You don't need to do that at all, ValueError is part of the built-in exceptions, not to mention the fact that you never use it in your code

Crawler script runs without error, but there's no output excel as I expected

I tried to crawl some housing information from a Chinese housing website. The code has no error when I run. However there's no output file when the running process completes.
import requests
from bs4 import BeautifulSoup
import sys
import os
import time
import pandas as pd
import numpy as np
from parsel import Selector
import re
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 BIDUBrowser/8.7 Safari/537.36'
}
def catchHouseList(url):
resp = requests.get(url, headers=headers, stream=True)
if resp.status_code == 200:
reg = re.compile('<li.*?class="clear">.*?<a.*?class="img.*?".*?href="(.*?)"')
urls = re.findall(reg, resp.text)
return urls
return []
def catchHouseDetail(url):
resp = requests.get(url, headers=headers)
print(url)
if resp.status_code == 200:
info = {}
soup = BeautifulSoup(resp.text, 'html.parser')
info['Title'] = soup.select('.main')[0].text
info['Total_Price'] = soup.select('.total')[0].text
info['Unit_Price'] = soup.select('.unit')[0].text
info['Price_per_square'] = soup.select('.unitPriceValue')[0].text
# p = soup.select('.tax')
# info['Reference_price'] = soup.select('.tax')[0].text
info['Built_time'] = soup.select('.subInfo')[2].text
info['Place_Name'] = soup.select('.info')[0].text
info['Area'] = soup.select('.info a')[0].text + ':' + soup.select('.info a')[1].text
info['Lianjia_number'] = str(url)[34:].rsplit('.html')[0]
info['flooring_plan'] = str(soup.select('.content')[2].select('.label')[0].next_sibling)
info['floor'] = soup.select('.content')[2].select('.label')[1].next_sibling
info['Area_Size'] = soup.select('.content')[2].select('.label')[2].next_sibling
info['Flooring_structure'] = soup.select('.content')[2].select('.label')[3].next_sibling
info['Inner_Area'] = soup.select('.content')[2].select('.label')[4].next_sibling
info['Building_Category'] = soup.select('.content')[2].select('.label')[5].next_sibling
info['House_Direction'] = soup.select('.content')[2].select('.label')[6].next_sibling
info['Building_Structure'] = soup.select('.content')[2].select('.label')[7].next_sibling
info['Decoration'] = soup.select('.content')[2].select('.label')[8].next_sibling
info['Stair_Number'] = soup.select('.content')[2].select('.label')[9].next_sibling
info['Heating'] = soup.select('.content')[2].select('.label')[10].next_sibling
info['Elevator'] = soup.select('.content')[2].select('.label')[11].next_sibling
# info['Aseest_Year'] = str(soup.select('.content')[2].select('.label')[12].next_sibling)
return info
pass
def appendToXlsx(info):
fileName = './second_hand_houses.xlsx'
dfNew = pd.DataFrame([info])
if (os.path.exists(fileName)):
sheet = pd.read_excel(fileName)
dfOld = pd.DataFrame(sheet)
df = pd.concat([dfOld, dfNew])
df.to_excel(fileName)
else:
dfNew.to_excel(fileName)
def catch():
pages = ['https://zs.lianjia.com/ershoufang/guzhenzhen/pg{}/'.format(x) for x in range(1, 21)]
for page in pages:
print(page)
houseListURLs = catchHouseList(page)
for houseDetailUrl in houseListURLs:
try:
info = catchHouseDetail(houseDetailUrl)
appendToXlsx(info)
except:
pass
time.sleep(2)
pass
if __name__ == '__main__':
catch()
I expected to have an excel output, but there's nothing in the end. Only telling me that the Process finished with exit code 0.

Here's one of your problem areas, with a little rewrite to help you see it. You were returning an empty list when that status code was anything other than 200, without any warning or explanation. The rest of your script requires a list to continue running. When you return an empty list, it exits cleanly.
Now, when you run your code, this function is going to return None when the server response isn't 200, and then a TypeError is going to be raised in your catch() function, which will require further error handling.
def catchHouseList(url):
try:
resp = requests.get(url, headers=headers, stream=True)
if resp.status_code == 200:
reg = re.compile(
'<li.*?class="clear">.*?<a.*?class="img.*?".*?href="(.*?)"')
urls = re.findall(reg, resp.text)
return urls
else:
print('catchHouseList response code:', resp.status_code)
except Exception as e:
print('catchHouseList:', e)

How can I download high resolution images from google use python + selenium + phantomJS

I want to fetch more than 100 high resolution images from google, use python2.7 + selenium + PhantomJS.
But since I act like they said, I could only get a webpage with small images on it. And I can't find out any link to the high resolution pictures directly. How could I fix it?
My code is as below.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
class ImgCrawler:
def __init__(self,searchlink = None):
self.link = searchlink
self.soupheader = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
self.scrolldown = None
self.jsdriver = None
def getPhantomJSDriver(self):
self.jsdriver = webdriver.PhantomJS()
self.jsdriver.get(self.link)
def scrollDownUsePhatomJS(self, scrolltimes = 1, sleeptime = 10):
for i in range(scrolltimes):
self.jsdriver.execute_script('window.scrollTo(0,document.body.scrollHeight);')
time.sleep(sleeptime)
def getSoup(self, parser=None):
print 'a', self.jsdriver.page_source
return BeautifulSoup(self.jsdriver.page_source, parser)
def getActualUrl(self, soup=None, flag=None, classflag=None, jsonflaglink=None, jsonflagtype=None):
actualurl = []
for a in soup.find_all(flag, {"class": classflag}):
link = json.loads(a.text)[jsonflaglink]
filetype = json.loads(a.text)[jsonflagtype]
detailurl = link + u'.' + filetype
actualurl.append(detailurl)
return actualurl
if __name__ == '__main__':
search_url = "https://www.google.com.hk/search?safe=strict&hl=zh-CN&site=imghp&tbm=isch&source=hp&biw=&bih=&btnG=Google+%E6%90%9C%E7%B4%A2&q="
queryword = raw_input()
query = queryword.split()
query = '+'.join(query)
weblink = search_url + query
img = ImgCrawler(weblink)
img.getPhantomJSDriver()
img.scrollDownUsePhatomJS(2,5)
soup = img.getSoup('html.parser')
print weblink
print soup
actualurllist = img.getActualUrl(soup,'div','rg_meta','ou','ity')
print len(actualurllist)

I tried for a long time to use PhantomJS but ended up using Chrome which is not what you asked for I know, but it works. I could not get it to work with PhantomJS.
First get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads you can use a headless version of chrome "Chrome Canary" if you are on Windows.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import re
import urlparse
class ImgCrawler:
def __init__(self,searchlink = None):
self.link = searchlink
self.soupheader = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
self.scrolldown = None
self.jsdriver = None
def getPhantomJSDriver(self):
self.jsdriver = webdriver.Chrome()
self.jsdriver.get(self.link)
def scrollDownUsePhatomJS(self, scrolltimes = 1, sleeptime = 10):
for i in range(scrolltimes):
self.jsdriver.execute_script('window.scrollTo(0,document.body.scrollHeight);')
time.sleep(sleeptime)
def getSoup(self, parser=None):
print 'a', self.jsdriver.page_source
return BeautifulSoup(self.jsdriver.page_source, parser)
def getActualUrl(self, soup=None):
actualurl = []
r = re.compile(r"/imgres\?imgurl=")
for a in soup.find_all('a', href=r):
parsed = urlparse.urlparse(a['href'])
url = urlparse.parse_qs(parsed.query)['imgurl']
actualurl.append(url)
print url
return actualurl
if __name__ == '__main__':
search_url = "https://www.google.com.hk/search?safe=strict&hl=zh-CN&site=imghp&tbm=isch&source=hp&biw=&bih=&btnG=Google+%E6%90%9C%E7%B4%A2&q="
queryword = raw_input()
query = queryword.split()
query = '+'.join(query)
weblink = search_url + query
img = ImgCrawler(weblink)
img.getPhantomJSDriver()
img.scrollDownUsePhatomJS(2,5)
soup = img.getSoup('html.parser')
print weblink
print soup
actualurllist = img.getActualUrl(soup)
print len(actualurllist)
I changed getActualUrl() to get the image url from an "a" element with a "href" attribute starting with "/imgres?imgurl="
Outputs (when "hazard" is typed in to the terminal):
[u'https://static.independent.co.uk/s3fs-public/styles/article_small/public/thumbnails/image/2016/12/26/16/eden-hazard.jpg']
[u'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b1/EdenHazardDecember_2016.jpg/200px-EdenHazardDecember_2016.jpg']
[u'http://a.espncdn.com/combiner/i/?img=/photo/2016/1227/r166293_1296x729_16-9.jpg&w=738&site=espnfc']
[u'https://platform-static-files.s3.amazonaws.com/premierleague/photos/players/250x250/p42786.png']
[u'https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Eden_Hazard_-_DK-Chel15_%286%29.jpg/150px-Eden_Hazard_-_DK-Chel15_%286%29.jpg']
[u'http://a.espncdn.com/combiner/i/?img=/photo/2017/0117/r172004_1296x729_16-9.jpg&w=738&site=espnfc']
[u'http://images.performgroup.com/di/library/GOAL/98/c0/eden-hazard-chelsea_1eohde060wvft1elcskrgihxq3.jpg?t=-1500575837&w=620&h=430']
[u'http://e1.365dm.com/17/03/16-9/20/skysports-eden-hazard-chelsea_3918835.jpg?20170331154242']
[u'http://a.espncdn.com/combiner/i/?img=/photo/2017/0402/r196036_1296x729_16-9.jpg&w=738&site=espnfc']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/04/nintchdbpict000316361045.jpg?strip=all&w=670&quality=100']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2017/02/10/14/eden-hazard1.jpg']
[u'http://s.newsweek.com/sites/www.newsweek.com/files/2016/11/07/eden-hazard.jpg']
[u'http://www.newstube24.com/wp-content/uploads/2017/06/haz.jpg']
[u'http://images.performgroup.com/di/library/GOAL/17/b1/eden-hazard_68ypnelg3lfd14oxkffztftt6.png?t=-1802977526&w=620&h=430']
[u'https://upload.wikimedia.org/wikipedia/commons/thumb/e/eb/DK-Chel15_%288%29.jpg/220px-DK-Chel15_%288%29.jpg']
[u'http://images.performgroup.com/di/library/omnisport/50/3f/hazard-cropped_3y08vc3ejpua168e9mgvu4mwc.jpg?t=-930203025&w=620&h=430']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/03/nintchdbpict000291621611-e1490777105213.jpg?strip=all&w=745&quality=100']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2016/01/14/14/Eden-Hazard.jpg']
[u'https://upload.wikimedia.org/wikipedia/commons/thumb/5/56/Eden_Hazard%2713-14.JPG/150px-Eden_Hazard%2713-14.JPG']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/03/nintchdbpict000296311943-e1490777241155.jpg?strip=all&w=596&quality=100']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2016/01/27/11/hazard.jpg']
[u'http://www.newzimbabwe.com/FCKEditor_Images/Eden-Hazard-896286.jpg']
[u'http://images.performgroup.com/di/library/GOAL/9c/93/eden-hazard_d4lbib7wdagw1hp2e5gnyov0k.jpg?t=-1763574189&w=620&h=430']
[u'http://www.guoguiyan.com/data/out/94/69914569-hazard-wallpapers.jpg']
[u'http://static.guim.co.uk/sys-images/Football/Pix/pictures/2015/4/16/1429206099512/Eden-Hazard-009.jpg']
[u'https://metrouk2.files.wordpress.com/2017/04/pri_37621532.jpg?w=620&h=406&crop=1']
[u'http://alummata.com/wp-content/uploads/2016/04/Hazard.jpg']
[u'https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/Hazard_vs_Norwich_%28crop%29.jpg/150px-Hazard_vs_Norwich_%28crop%29.jpg']
[u'http://i.dailymail.co.uk/i/pix/2016/11/06/20/3A185FB800000578-3910886-image-a-46_1478462522592.jpg']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/467822629-465742.jpg']
[u'http://i.dailymail.co.uk/i/pix/2015/10/17/18/2D81CB1D00000578-0-image-a-37_1445102645249.jpg']
[u'http://images.performgroup.com/di/library/GOAL_INTERNATIONAL/27/ce/eden-hazard_1kepw6rvweted1hpfmp5xwd5cs.jpg?t=-228379025&w=620&h=430']
[u'http://img.skysports.com/16/12/768x432/skysports-chelsea-manchester-city-eden-hazard_3845204.jpg?20161203162258']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/03/nintchdbpict0003089026162.jpg?strip=all&w=960&quality=100']
[u'http://www.chelseafc.com/content/cfc/en/homepage/news/boilerplate-config/latest-news/2016/05/hazard--rediscovering-our-form.img.png']
[u'http://images.performgroup.com/di/library/omnisport/b5/98/hazard-cropped_172u0n8wx4j071cvgs1n3yycvw.jpg?t=2030908123&w=620&h=430']
[u'http://images.indianexpress.com/2016/05/eden-hazard-m.jpg']
[u'http://i2.mirror.co.uk/incoming/article9755579.ece/ALTERNATES/s615/PAY-Chelsea-v-Arsenal-Premier-League.jpg']
[u'http://www.chelseafc.com/content/cfc/en/homepage/news/boilerplate-config/latest-news/2017/06/hazard-injury-update.img.png']
[u'http://futhead.cursecdn.com/static/img/fm/17/players/183277_HAZARDCAM7.png']
[u'http://images.performgroup.com/di/library/GOAL/4d/6/eden-hazard-chelsea-06032017_enh1ll3uadj01ocstyopie9e4.jpg?t=-1521106510&w=620&h=430']
[u'http://images.performgroup.com/di/library/GOAL/34/8/eden-hazard-chelsea-southampton_1oca1rpy37gmn1ldmqvytti3k4.jpg?t=-1501721805&w=620&h=430']
[u'http://media.gettyimages.com/photos/eden-hazard-of-chelsea-celebrates-scoring-his-sides-third-goal-during-picture-id617452212?s=612x612']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/secondary/Eden-Hazard-889782.jpg']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2015/10/19/16/Hazard.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/03/nintchdbpict000307050894.jpg?strip=all&w=660&quality=100']
[u'http://e1.365dm.com/16/11/16-9/20/skysports-eden-hazard-chelsea-football_3833122.jpg?20161116153005']
[u'http://thumb.besoccer.com/media/img_news/morata-y-hazard--besoccer.jpg']
[u'https://static.independent.co.uk/s3fs-public/styles/story_medium/public/thumbnails/image/2017/03/13/21/10-hazard.jpg']
[u'https://static.independent.co.uk/s3fs-public/styles/story_medium/public/thumbnails/image/2016/12/27/13/hazard.jpg']
[u'http://images.performgroup.com/di/library/GOAL/63/2a/eden-hazard-chelsea_15ggj1j39rmky1c5a20oxt3tly.jpg?t=1297762370']
[u'http://i1.mirror.co.uk/incoming/article9755531.ece/ALTERNATES/s615b/Chelsea-v-Arsenal-Premier-League.jpg']
[u'http://cf.c.ooyala.com/l2YmpyYjE6yvLSxGiEebNMr3N1ANS1Xc/O0cEsGv5RdudyPNn4xMDoxOjBnO_4SLA']
[u'http://media.gettyimages.com/photos/eden-hazard-of-chelsea-celebrates-scoring-his-sides-third-goal-during-picture-id617452006?s=612x612']
[u'http://a.espncdn.com/combiner/i/?img=/photo/2017/0413/r199412_2_1296x729_16-9.jpg&w=738&site=espnfc']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/04/nintchdbpict000318703991-e1493109803795.jpg?strip=all&w=960&quality=100']
[u'https://static.independent.co.uk/s3fs-public/styles/story_medium/public/thumbnails/image/2016/11/18/14/hazard-award.jpg']
[u'http://static.goal.com/2477200/2477282_heroa.jpg']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2016/12/16/14/eden-hazard.jpg']
[u'http://www.guoguiyan.com/data/out/94/69979129-hazard-wallpapers.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2016/11/nintchdbpict0002769424741.jpg?w=960&strip=all']
[u'http://v.uecdn.es/p/110/thumbnail/entry_id/0_ofavjqr8/width/660/cache_st/20170327164629/type/2/bgcolor/000000/0_ofavjqr8.jpg']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2017/02/07/10/edenhazard.jpg']
[u'http://theworldgame.sbs.com.au/sites/sbs.com.au.theworldgame/files/styles/full/public/images/e/d/eden-hazard-cropped_g6m28ldoc0b41p3f5sp2vlflt.jpg?itok=XW5M7QEA']
[u'http://e0.365dm.com/17/03/16-9/20/skysports-eden-hazard-chelsea_3909051.jpg?20170314181126']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/secondary/Chelsea-Hazard-goals-886084.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/04/nintchdbpict000319343894.jpg?strip=all&w=960&quality=100']
[u'https://www.footyrenders.com/render/Eden-Hazard-PL.png']
[u'http://media.gettyimages.com/photos/eden-hazard-of-chelsea-celebrates-as-he-scores-their-first-goal-the-picture-id672946758?s=612x612']
[u'https://pbs.twimg.com/profile_images/791664465729163264/XbCVl6BF.jpg']
[u'https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/Eden_Hazard_2011.jpg/170px-Eden_Hazard_2011.jpg']
[u'http://s.newsweek.com/sites/www.newsweek.com/files/2016/02/01/guus-hiddink-says-eden-hazard-could-leave-chelsea..jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2016/06/chelsea_hazard_mobile_top.jpg?strip=all&w=750&h=352&crop=1']
[u'http://cdn.images.dailystar.co.uk/dynamic/58/photos/735000/Eden-Hazard-887735.jpg']
[u'http://i.telegraph.co.uk/multimedia/archive/03580/Hazard_Real_copy_3580583b.jpg']
[u'https://premierleague-static-files.s3.amazonaws.com/premierleague/photo/2017/05/21/47a1f452-43e4-4215-a5b8-5043c1e12a07/686302908.jpg']
[u'http://www.chelseafc.com/content/cfc/en/homepage/news/boilerplate-config/latest-news/2017/06/hazard-s-highlights.img.png']
[u'http://i.dailymail.co.uk/i/pix/2016/12/14/15/3B45B4D300000578-4032306-Hazard_PFA_Player_of_the_Year_in_2015_has_rediscovered_his_form_-a-6_1481728291902.jpg']
[u'https://img.rasset.ie/000d5137-800.jpg']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/Eden-Hazard-665260.jpg']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/Eden-Hazard-659164.jpg']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/secondary/Hazard-scored-Chelsea-s-third-goal-against-Tottenham-909804.jpg']
[u'http://cdn.images.dailystar.co.uk/dynamic/58/photos/237000/Eden-Hazard-887237.jpg']
[u'http://a.espncdn.com/combiner/i/?img=/media/motion/ESPNi/2017/0405/int_170405_Hazard_the_successor_to_Ronaldo_at_Real/int_170405_Hazard_the_successor_to_Ronaldo_at_Real.jpg&w=738&site=espnfc']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/Eden-Hazard-721522.jpg']
[u'http://media.gettyimages.com/photos/eden-hazard-of-chelsea-celebrates-with-teammates-after-scoring-his-picture-id633776492?s=612x612']
[u'http://betinmalta.com/wp-content/uploads/2017/05/hazard.jpg']
[u'http://cdn.images.dailystar.co.uk/dynamic/58/photos/708000/Eden-Hazard-Chelsea-712708.jpg']
[u'http://images.performgroup.com/di/library/omnisport/c9/4a/eden-hazard-cropped_12u5irb6bkze1cue2wpjzxa44.jpg?t=-2084914038&w=620&h=430']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/06/nintchdbpict0003291256741.jpg?strip=all&w=714&quality=100']
[u'https://premierleague-static-files.s3.amazonaws.com/premierleague/photo/2017/03/10/f97d36aa-1eef-4a78-996f-63d543c79efc/700017169TS004_Eden_Hazard_.JPG']
[u'https://s-media-cache-ak0.pinimg.com/736x/f0/01/17/f001178defb2b3be3cffb5e9b792748b--eden-hazard-liverpool-england.jpg']
[u'http://i4.mirror.co.uk/incoming/article9898829.ece/ALTERNATES/s615b/hazard-2.jpg']
[u'http://images.performgroup.com/di/library/GOAL/24/76/eden-hazard-and-lionel-messi_kamv8simc20x1p2i2fcf7lllw.png?t=421166242&w=620&h=430']
[u'https://metrouk2.files.wordpress.com/2017/03/gettyimages-618471206.jpg?w=748&h=498&crop=1']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/Hazard-Chelsea-658138.jpg']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2017/04/23/17/hazard.jpg']
[u'http://e0.365dm.com/15/10/16-9/20/eden-hazard-comparison-chelsea_3365521.jpg?20151018152317']
[u'http://cdn.images.express.co.uk/img/dynamic/galleries/x701/231048.jpg']
[u'http://cdn.images.express.co.uk/img/dynamic/galleries/x701/102742.jpg']
[u'https://i.ytimg.com/vi/GWhVkFTe_BY/maxresdefault.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/05/nintchdbpict000319343748-e1496260888520.jpg?strip=all&w=960&quality=100']
[u'https://metrouk2.files.wordpress.com/2017/06/689818982.jpg?w=748&h=498&crop=1']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/03/nintchdbpict000308902470.jpg?strip=all&w=960&quality=100']
[u'https://www.thesun.co.uk/wp-content/uploads/2016/12/nintchdbpict000289569836.jpg?w=960&strip=all']
[u'https://i.ytimg.com/vi/zZ9stt70_vU/maxresdefault.jpg']
[u'https://upload.wikimedia.org/wikipedia/commons/2/2c/Kylian_Hazard_%28cropped%29.jpg']
[u'http://e00-marca.uecdn.es/assets/multimedia/imagenes/2017/03/26/14905092504845.jpg']
[u'http://images.performgroup.com/di/library/omnisport/ba/47/eden-hazard-cropped_rccnpv1me3v51kqpnj5ak4nko.jpg?t=1222186324&w=620&h=430']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/secondary/Eden-Hazard-922505.jpg']
[u'https://s-media-cache-ak0.pinimg.com/736x/48/ce/4c/48ce4c478d8b06dccacce352d9d4bdc2--eden-hazard-pogba.jpg']
[u'http://www.telegraph.co.uk/content/dam/football/2016/10/23/111897755_Editorial_use_only_No_merchandising_For_Football_images_FA_and_Premier_League_restrict-large_trans_NvBQzQNjv4BqqVzuuqpFlyLIwiB6NTmJwfSVWeZ_vEN7c6bHu2jJnT8.jpg']
[u'https://metrouk2.files.wordpress.com/2017/06/686902184.jpg?w=748&h=652&crop=1']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/03/nintchdbpict0003086723271-e1489501904590.jpg?strip=all&w=960&quality=100']
[u'http://e2.365dm.com/17/01/16-9/20/skysports-chelsea-manchester-city-eden-hazard_3862340.jpg?20170406190414']
[u'http://www.telegraph.co.uk/content/dam/football/2017/05/26/TELEMMGLPICT000129483487-large_trans_NvBQzQNjv4BqajCpFXsei0OXjDFGPZkcdJOkVdu-K0ystYH4SV7DHn8.jpeg']
[u'https://i.ytimg.com/vi/FFE4Ea437ks/maxresdefault.jpg']
[u'https://i1.wp.com/www.vanguardngr.com/wp-content/uploads/2017/03/Hazard-madrid.png?resize=350%2C200']
[u'http://china.chelseafc.com/content/cfc/zh/homepage/teams/first-team/eden-hazard/summary/_jcr_content/tabparmain/box/box/textimage/image.img.jpg/1496846329140.jpg']
[u'http://static.goal.com/198700/198707_news.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/05/nintchdbpict000319357531.jpg?strip=all&w=960&quality=100']
[u'http://media.gettyimages.com/photos/eden-hazard-of-chelsea-poses-with-the-premier-league-trophy-after-the-picture-id686826640?s=612x612']
[u'http://cf.c.ooyala.com/t3dXdzYjE6VJktcnKdi7F2205I_mSSKQ/eWNh-8akTAF2kj8X4xMDoxOjBnO_4SLA']
[u'http://c.smimg.net/16/39/300x225/eden-hazard.jpg']
[u'http://www.whatfootballersearn.com/wp-content/uploads/Eden-Hazard.jpg']
[u'http://cdn.images.dailystar.co.uk/dynamic/58/photos/328000/Eden-Hazard-437328.jpg']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/secondary/Eden-Hazard-Chelsea-882846.jpg']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/Chelsea-star-Eden-Hazard-741161.jpg']
[u'https://talksport.com/sites/default/files/styles/taxonomy-img/public/field/image/201703/hazard_0.jpg']
[u'http://i.dailymail.co.uk/i/pix/2016/08/28/21/37A101A700000578-3762573-image-a-19_1472417354943.jpg']
[u'http://www.telegraph.co.uk/content/dam/football/2016/07/27/87650659-edenhazard-sport-large_trans_NvBQzQNjv4BqqVzuuqpFlyLIwiB6NTmJwfSVWeZ_vEN7c6bHu2jJnT8.jpg']
[u'http://cdn.images.dailystar.co.uk/dynamic/58/photos/165000/620x/Eden-Hazard-598896.jpg']
[u'http://i.dailymail.co.uk/i/pix/2016/05/04/21/33C8A26600000578-0-image-a-19_1462392130112.jpg']
[u'https://ichef-1.bbci.co.uk/news/660/cpsprodpb/13AA1/production/_96354508_595836d4-f21a-419b-95cb-37a65204f6eb.jpg']
[u'https://premierleague-static-files.s3.amazonaws.com/premierleague/photo/2016/11/30/1eb421ae-b210-4a01-95bb-36f509826cc1/Debruyne_v_Hazard.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/03/nintchdbpict000309639487-e1490040388851.jpg?strip=all&w=739&quality=100']
[u'http://static.goal.com/3311200/3311292_heroa.jpg']
[u'http://i3.mirror.co.uk/incoming/article7986781.ece/ALTERNATES/s615b/Hazard-and-son.jpg']
[u'http://a.espncdn.com/combiner/i/?img=/photo/2016/0916/r126535_1296x729_16-9.jpg&w=738&site=espnfc']
[u'http://www.chelseafc.com/content/cfc/en/homepage/news/boilerplate-config/latest-news/2017/03/hazard-score-is-number-one-.img.png']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2015/03/17/13/eden-hazard.jpg']
[u'https://metrouk2.files.wordpress.com/2017/05/680506564.jpg?w=748&h=457&crop=1']
[u'http://media.gettyimages.com/photos/eden-hazard-of-chelsea-celebrates-with-diego-costa-of-chelsea-after-picture-id671559962?s=612x612']
[u'http://e0.365dm.com/17/05/16-9/20/skysports-eden-hazard-chelsea_3965489.jpg?20170529101357']
[u'https://s-media-cache-ak0.pinimg.com/736x/e0/80/0e/e0800e380ef363594fb292969b7c5b64--eden-hazard-chelsea-soccer.jpg']
[u'http://cdn-football365.365.co.za/content/uploads/2016/12/GettyImages.630542828.jpg']
[u'http://i.dailymail.co.uk/i/pix/2016/07/16/19/340E4A9A00000578-3693637-image-a-84_1468694248523.jpg']
[u'http://www.squawka.com/news/wp-content/uploads/2017/01/hazard-chelsea-e1485528066609.jpg']
[u'http://www.guoguiyan.com/data/out/94/68880901-hazard-wallpapers.jpg']
[u'http://www.telegraph.co.uk/content/dam/football/2017/03/12/JS122962983_EHazDavid-Rose-large_trans_NvBQzQNjv4BqtA9hvt4yaDuJhaJG2frTIUNrh1MdssoHpGF6OIxC49c.jpg']
[u'http://images.performgroup.com/di/library/GOAL/17/25/eden-hazard-chelsea_1dsnlf2z113cx10nxvp9ydudcz.jpg?t=2008335075']
[u'http://www.telegraph.co.uk/content/dam/football/2016/12/05/115182685-eden-hazard-sport-large_trans_NvBQzQNjv4BqA7a2BP2KFPtZUOepzpZgXISdNn8DgVUcalGVREaviFE.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/04/sport-preview-morata-hazard.jpg?strip=all&quality=100&w=750&h=500&crop=1']
[u'http://thumb.besoccer.com/media/img_news/eden-hazard--futbolista-del-chelsea--chelseafc.jpg']
[u'http://i4.mirror.co.uk/incoming/article7374471.ece/ALTERNATES/s615/Chelsea-Training-Session.jpg']
[u'https://metrouk2.files.wordpress.com/2017/04/671549404.jpg?w=748&h=532&crop=1']
[u'https://metrouk2.files.wordpress.com/2016/02/462363538.jpg?w=748&h=563&crop=1']
[u'https://metrouk2.files.wordpress.com/2017/05/6834221661.jpg?w=748&h=507&crop=1']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/Chelsea-star-Eden-Hazard-739447.jpg']
[u'http://cdn.quotesgram.com/img/21/41/114220036-24CA6E4700000578-2916442-Eden_Hazard_has_been_instrumental_for_Chelsea_this_season_as_the-a-7_1421682779132.jpg']
[u'http://i.dailymail.co.uk/i/pix/2016/09/29/09/38E4401500000578-3813294-image-a-1_1475138637248.jpg']
[u'http://healthyceleb.com/wp-content/uploads/2016/04/Eden-Hazard-match-between-Chelsea-Milton-Keynes-Dons-January-2016.jpg']
[u'https://talksport.com/sites/default/files/styles/taxonomy-img/public/field/image/201704/gettyimages-663029916.jpg']
[u'https://upload.wikimedia.org/wikipedia/commons/thumb/2/29/Thorgan_Hazard_2014.jpg/220px-Thorgan_Hazard_2014.jpg']
[u'http://cdn.images.dailystar.co.uk/dynamic/58/photos/91000/620x/Eden-Hazard-632850.jpg']
[u'http://i4.mirror.co.uk/incoming/article7531141.ece/ALTERNATES/s615/A-dejected-looking-Eden-Hazard.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/05/nintchdbpict000322464448-e1494602676644.jpg?strip=all&w=960&quality=100']
[u'http://images.performgroup.com/di/library/GOAL_INTERNATIONAL/76/92/chelsea-bournemouth-eden-hazard_148j4p900kzba159diso6ewwvo.jpg?t=1004329665&w=620&h=430']
[u'https://images.cdn.fourfourtwo.com/sites/fourfourtwo.com/files/styles/inline-image/public/hazard3.jpg?itok=ap0DtuZx']
[u'https://talksport.com/sites/default/files/styles/taxonomy-img/public/field/image/201707/hazard.jpg']
...
[u'http://a.espncdn.com/combiner/i/?img=/photo/2017/0330/r195127_1296x729_16-9.jpg&w=738&site=espnfc']
299

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Program runs only with None results - python

Related

windows web-scraping encoding error: a non-major undergraduate, don't know what to do

Can't force a script to try few times when it fails to grab title from a webpage

Resolve Python Module Error To Enable Web Scraping script?

Crawler script runs without error, but there's no output excel as I expected

How can I download high resolution images from google use python + selenium + phantomJS

Categories

Resources