Resolve Python Module Error To Enable Web Scraping script? - python

Using stackoverflow for the first time trying to figure out how to scrape Yelp data and having a hard time. Have set up LXML, beautiful soup, requests, PIP, Python and have added these to the path in system variables yet I am still getting the error below when I try to run code below. Any suggestions?
File "test2.py", line 4, in
from exceptions import ValueError
ModuleNotFoundError: No module named 'exceptions'
from lxml import html
import json
import requests
from exceptions import ValueError
import re, urllib
import urllib3
import argparse
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
from requests.packages.urllib3.exceptions import InsecureRequestWarning
import time
from concurrent.futures import ThreadPoolExecutor
import sys
from threading import Thread
import os
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
#[#'https://www.yelp.com/biz/kdb-kitchen-den-bar-long-beach',
yelp_urls =['https://www.yelp.com/biz/the-atlas-room-washington','https://www.yelp.com/biz/the-rack-brandon','https://www.yelp.com/biz/payard-p%C3%A2tisserie-and-bistro-new-york-2','https://www.yelp.com/biz/maison-giraud-pacific-palisades','https://www.yelp.com/biz/saltbox-san-diego','https://www.yelp.com/biz/carmichaels-chicago-steak-house-chicago','https://www.yelp.com/biz/black-eyed-pea-restaurant-houston-6','https://www.yelp.com/biz/perfecto-mundo-latin-fusion-bistro-commack','https://www.yelp.com/biz/smittys-bbq-boyd','https://www.yelp.com/biz/reston-kabob-reston','https://www.yelp.com/biz/bookmark-cafe-largo','https://www.yelp.com/biz/the-tin-angel-pittsburgh','https://www.yelp.com/biz/briantos-original-hoagies-orlando','https://www.yelp.com/biz/freeway-diner-woodbury','https://www.yelp.com/biz/river-gods-cambridge','https://www.yelp.com/biz/golan-kosher-restaurant-north-hollywood-2','https://www.yelp.com/biz/city-hall-restaurant-new-york-2','https://www.yelp.com/biz/empire-pizza-and-grill-west-chester','https://www.yelp.com/biz/cityzen-washington-2','https://www.yelp.com/biz/three-degrees-los-gatos','https://www.yelp.com/biz/applebees-grill-bar-quakertown','https://www.yelp.com/biz/johnny-carinos-covina','https://www.yelp.com/biz/buffet-de-la-gare-hastings-hdsn','https://www.yelp.com/biz/continental-food-management-la-mirada','https://www.yelp.com/biz/elephant-bar-restaurant-peoria','https://www.yelp.com/biz/sullivans-steakhouse-denver','https://www.yelp.com/biz/yucatan-liquid-stand-coppell','https://www.yelp.com/biz/tomato-pie-morristown','https://www.yelp.com/biz/willett-house-port-chester','https://www.yelp.com/biz/thai-corner-san-antonio-2','https://www.yelp.com/biz/silkes-american-grill-mesa','https://www.yelp.com/biz/t-mex-cantina-fort-lauderdale-2','https://www.yelp.com/biz/casa-oaxaca-washington','https://www.yelp.com/biz/wings-on-wheels-hebron','https://www.yelp.com/biz/siris-thai-french-cuisine-cherry-hill','https://www.yelp.com/biz/nightwood-chicago','https://www.yelp.com/biz/cafe-gallery-burlington','https://www.yelp.com/biz/the-hurricane-caf%C3%A9-seattle-2','https://www.yelp.com/biz/231-ellsworth-san-mateo','https://www.yelp.com/biz/la-marmite-williston-park','https://www.yelp.com/biz/the-river-house-palm-beach-gardens-2','https://www.yelp.com/biz/langermanns-baltimore','https://www.yelp.com/biz/del-friscos-grille-phoenix','https://www.yelp.com/biz/carrows-family-restaurant-antioch','https://www.yelp.com/biz/minerva-fine-indian-herndon-va-herndon-5','https://www.yelp.com/biz/the-mason-bar-dallas','https://www.yelp.com/biz/la-cote-cafe-and-wine-bar-seattle','https://www.yelp.com/biz/vareli-new-york','https://www.yelp.com/biz/wendys-wixom','https://www.yelp.com/biz/lanterna-tuscan-bistro-nyack','https://www.yelp.com/biz/yo-taco-duxbury','https://www.yelp.com/biz/bombay-palace-new-york','https://www.yelp.com/biz/cafe-buonaros-naperville','https://www.yelp.com/biz/ponti-seafood-grill-seattle-3','https://www.yelp.com/biz/bill-johnsons-big-apple-restaurants-phoenix-5','https://www.yelp.com/biz/by-word-of-mouth-oakland-park','https://www.yelp.com/biz/anna-maries-pizza-and-restaurant-wharton','https://www.yelp.com/biz/dierdorf-and-harts-steakhouse-saint-louis','https://www.yelp.com/biz/wine-5-cafe-las-vegas','https://www.yelp.com/biz/ernies-restaurant-plymouth','https://www.yelp.com/biz/next-door-pizza-and-pub-lees-summit','https://www.yelp.com/biz/lannys-alta-cocina-mexicana-fort-worth','https://www.yelp.com/biz/jalisco-mexican-restaurant-eastlake','https://www.yelp.com/biz/clio-boston','https://www.yelp.com/biz/uncommon-grounds-aliquippa','https://www.yelp.com/biz/uozumi-restaurant-palmdale','https://www.yelp.com/biz/enzos-pizza-matawan','https://www.yelp.com/biz/the-pointe-cafe-south-san-francisco','https://www.yelp.com/biz/captains-restaurant-and-seafood-market-florida-city','https://www.yelp.com/biz/le-perigord-new-york-4','https://www.yelp.com/biz/i-love-thai-arlington','https://www.yelp.com/biz/bistro-44-bedford','https://www.yelp.com/biz/ritters-marietta','https://www.yelp.com/biz/rouge-et-blanc-new-york','https://www.yelp.com/biz/assembly-steak-house-and-seafood-grill-englewood-cliffs-2','https://www.yelp.com/biz/american-turkish-restaurant-fort-lauderdale','https://www.yelp.com/biz/r-and-r-bar-b-que-and-catering-service-missouri-2','https://www.yelp.com/biz/sushi-land-long-beach','https://www.yelp.com/biz/longshots-sports-bar-waretown','https://www.yelp.com/biz/salt-creek-barbeque-glendale-heights','https://www.yelp.com/biz/pizza-market-breese','https://www.yelp.com/biz/john-qs-steakhouse-cleveland','https://www.yelp.com/biz/bistro-n-boca-raton-2','https://www.yelp.com/biz/samanthas-restaurant-silver-spring-2','https://www.yelp.com/biz/baha-brothers-sandbar-grill-taunton-3','https://www.yelp.com/biz/cafe-cortina-farmington-hills-5','https://www.yelp.com/biz/big-beaver-tavern-troy','https://www.yelp.com/biz/hogans-restaurant-bloomfield-hills','https://www.yelp.com/biz/the-copper-monkey-beaverton','https://www.yelp.com/biz/clement-street-bar-and-grill-san-francisco','https://www.yelp.com/biz/pepin-scottsdale','https://www.yelp.com/biz/village-belle-philadelphia','https://www.yelp.com/biz/sweet-woodruff-san-francisco','https://www.yelp.com/biz/siam-marina-tinley-park','https://www.yelp.com/biz/luigis-italian-restaurant-centennial-2','https://www.yelp.com/biz/smokin-wills-barbecue-roselle','https://www.yelp.com/biz/voltaire-restaurant-scottsdale','https://www.yelp.com/biz/jus-cookins-restaurant-lakewood-2','https://www.yelp.com/biz/pegs-countryside-cafe-hamel','https://www.yelp.com/biz/rays-grill-fulshear','https://www.yelp.com/biz/cafe-zalute-rosemont','https://www.yelp.com/biz/guard-house-inn-gladwyne','https://www.yelp.com/biz/road-runner-grand-canyon-las-vegas-2','https://www.yelp.com/biz/garage-restaurant-and-cafe-new-york','https://www.yelp.com/biz/los-tapatios-cedar-hill','https://www.yelp.com/biz/chengdu-46-clifton','https://www.yelp.com/biz/moby-dick-house-of-kabob-fairfax','https://www.yelp.com/biz/natures-food-patch-clearwater','https://www.yelp.com/biz/taco-del-mar-hillsboro-3','https://www.yelp.com/biz/ms-tootsies-rbl-philadelphia','https://www.yelp.com/biz/the-big-c-athletic-club-concord','https://www.yelp.com/biz/west-hanover-pizzeria-hanover','https://www.yelp.com/biz/georges-pastaria-houston','https://www.yelp.com/biz/encuentro-oakland-3','https://www.yelp.com/biz/smokys-bbq-eldersburg','https://www.yelp.com/biz/ruby-tuesday-san-antonio','https://www.yelp.com/biz/saladworks-philadelphia-4','https://www.yelp.com/biz/captain-pizza-middleton','https://www.yelp.com/biz/bob-evans-fredericksburg-3','https://www.yelp.com/biz/frittata-clawson','https://www.yelp.com/biz/the-sandwich-spot-palm-springs','https://www.yelp.com/biz/freds-mexican-cafe-san-diego-4','https://www.yelp.com/biz/geordies-steak-phoenix-2','https://www.yelp.com/biz/five-guys-wayne-5','https://www.yelp.com/biz/zen-sushi-la-crescenta-2','https://www.yelp.com/biz/the-summit-steakhouse-aurora-2','https://www.yelp.com/biz/miramar-bistro-highwood','https://www.yelp.com/biz/mick-o-sheas-baltimore','https://www.yelp.com/biz/dennys-houston-30','https://www.yelp.com/biz/carls-jr-henderson-5','https://www.yelp.com/biz/mexican-town-restaurant-detroit','https://www.yelp.com/biz/sushi-roku-las-vegas','https://www.yelp.com/biz/giant-pizza-king-san-diego','https://www.yelp.com/biz/quiznos-brooklyn-6','https://www.yelp.com/biz/taco-bell-glen-ellyn','https://www.yelp.com/biz/las-tortas-locas-marietta','https://www.yelp.com/biz/smith-and-wollensky-las-vegas-2','https://www.yelp.com/biz/happy-garden-chinese-brighton','https://www.yelp.com/biz/urban-foodie-feed-store-college-park','https://www.yelp.com/biz/the-wolf-oakland','https://www.yelp.com/biz/scuzzis-italian-restaurant-san-antonio-4','https://www.yelp.com/biz/better-gourmet-health-kitchen-staten-island','https://www.yelp.com/biz/the-restaurant-and-cafe-warren','https://www.yelp.com/biz/mcdonalds-houston-214','https://www.yelp.com/biz/pyeong-chang-tofu-house-oakland','https://www.yelp.com/biz/maria-rosa-pizzeria-and-family-restaurant-flemington','https://www.yelp.com/biz/legends-sports-bar-and-grill-roseville-2','https://www.yelp.com/biz/villa-reale-pizzeria-and-restaurant-pittsburgh','https://www.yelp.com/biz/the-terrace-cafe-venice','https://www.yelp.com/biz/the-oval-room-washington-2','https://www.yelp.com/biz/high-point-coal-center','https://www.yelp.com/biz/j-and-s-montebello','https://www.yelp.com/biz/cheers-restaurant-and-bar-fort-lauderdale']
def parse_page(url):
# url = "https://www.yelp.com/biz/frances-san-francisco"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
response = requests.get(url, headers=headers, verify=False).text
parser = html.fromstring(response)
raw_name = parser.xpath("//h1[contains(#class,'page-title')]//text()")
raw_claimed = parser.xpath("//span[contains(#class,'claim-status_icon--claimed')]/parent::div/text()")
raw_reviews = parser.xpath("//div[contains(#class,'biz-main-info')]//span[contains(#class,'review-count rating-qualifier')]//text()")
raw_category = parser.xpath('//div[contains(#class,"biz-page-header")]//span[#class="category-str-list"]//a/text()')
hours_table = parser.xpath("//table[contains(#class,'hours-table')]//tr")
details_table = parser.xpath("//div[#class='short-def-list']//dl")
raw_map_link = parser.xpath("//a[#class='biz-map-directions']/img/#src")
raw_phone = parser.xpath(".//span[#class='biz-phone']//text()")
raw_address = parser.xpath('//div[#class="mapbox-text"]//div[contains(#class,"map-box-address")]//text()')
raw_wbsite_link = parser.xpath("//span[contains(#class,'biz-website')]/a/#href")
raw_price_range = parser.xpath("//dd[contains(#class,'price-description')]//text()")
raw_health_rating = parser.xpath("//dd[contains(#class,'health-score-description')]//text()")
rating_histogram = parser.xpath("//table[contains(#class,'histogram')]//tr[contains(#class,'histogram_row')]")
raw_ratings = parser.xpath("//div[contains(#class,'biz-page-header')]//div[contains(#class,'rating')]/#title")
working_hours = []
for hours in hours_table:
raw_day = hours.xpath(".//th//text()")
raw_timing = hours.xpath("./td//text()")
day = ''.join(raw_day).strip()
timing = ''.join(raw_timing).strip()
working_hours.append({day:timing})
info = []
for details in details_table:
raw_description_key = details.xpath('.//dt//text()')
raw_description_value = details.xpath('.//dd//text()')
description_key = ''.join(raw_description_key).strip()
description_value = ''.join(raw_description_value).strip()
info.append({description_key:description_value})
ratings_histogram = []
for ratings in rating_histogram:
raw_rating_key = ratings.xpath(".//th//text()")
raw_rating_value = ratings.xpath(".//td[#class='histogram_count']//text()")
rating_key = ''.join(raw_rating_key).strip()
rating_value = ''.join(raw_rating_value).strip()
ratings_histogram.append({rating_key:rating_value})
name = ''.join(raw_name).strip()
phone = ''.join(raw_phone).strip()
address = ' '.join(' '.join(raw_address).split())
health_rating = ''.join(raw_health_rating).strip()
price_range = ''.join(raw_price_range).strip()
claimed_status = ''.join(raw_claimed).strip()
reviews = ''.join(raw_reviews).strip()
category = ','.join(raw_category)
cleaned_ratings = ''.join(raw_ratings).strip()
if raw_wbsite_link:
decoded_raw_website_link = urllib.unquote(raw_wbsite_link[0])
website = re.findall("biz_redir\?url=(.*)&website_link",decoded_raw_website_link)[0]
else:
website = ''
if raw_map_link:
decoded_map_url = urllib.unquote(raw_map_link[0])
map_coordinates = re.findall("center=([+-]?\d+.\d+,[+-]?\d+\.\d+)",decoded_map_url)[0].split(',')
latitude = map_coordinates[0]
longitude = map_coordinates[1]
else:
latitude = ''
longitude = ''
if raw_ratings:
ratings = re.findall("\d+[.,]?\d+",cleaned_ratings)[0]
else:
ratings = 0
data={'working_hours':working_hours,
'info':info,
'ratings_histogram':ratings_histogram,
'name':name,
'phone':phone,
'ratings':ratings,
'address':address,
'health_rating':health_rating,
'price_range':price_range,
'claimed_status':claimed_status,
'reviews':reviews,
'category':category,
'website':website,
'latitude':latitude,
'longitude':longitude,
'url':url,
}
return data
def parse_reviews(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0 Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0.'}
response = requests.get(url, headers=headers, verify=False).text
parser = html.fromstring(response)
ratings_zipped = []
reviews = [x for x in parser.xpath("//div[contains(#class,'main-section')]//div[contains(#class,'review-list')]//div[contains(#class,'review')]//div[contains(#class,'review-content')]")]
for r in reviews:
date = r.xpath("./div[contains(#class,'biz-rating')]//span[contains(#class,'rating-qualifier')]/text()")[0].strip()
rating = r.xpath("./div[contains(#class,'biz-rating')]//div[contains(#class,'rating-large')]/#title")[0]
content = r.xpath("./p")[0].text_content()
ratings_zipped.append([date, rating, content])
print (len(ratings_zipped))
return ratings_zipped
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
def parse_pagination(url):
print (url)
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
response = requests.get(url, headers=headers, verify=False)
print (response)
parser = html.fromstring(response.text)
try:
results = (int(parser.xpath("//div[contains(#class,'page-of-pages')]//text()")[0].strip().split(' ').pop()))*20
except IndexError:
results = 20
print (results)
return results
def get_businesses_data(data):
businesses, failed_searches = [], []
start_time = time.time()
result = {}
for i,url in enumerate(data):
print ('Starting iteration: ', i)
result['url']= url
pagination = parse_pagination(url)
print ('Pagination: ', pagination)
info = parse_page(url)
result['info'] = info
_reviews = []
for v in xrange(0,pagination,20):
paginated_url = result['url'].split('?')[0] + '?start='+str(v)
print ('Scraping Reviews: ', paginated_url)
_reviews += parse_reviews(paginated_url)
time.sleep(.5)
result['scraped_reviews'] = _reviews
result['scraped_reviews_count'] = len(_reviews)
businesses.append(result)
print ('Success iteration: ', i)
# print ('Results: ', result)
print ('Num of reviews: ', str(len(_reviews)))
print('')
print ('Time Elapsed: ', str(time.time() - start_time))
return businesses
if __name__=="__main__":
index = 5
#0
size = 20
i = index*20
chunk = yelp_urls[i:i+size]
businesses = get_businesses_data(chunk)
with open ('results/run_3/output_{}.json'.format(i), 'w') as f:
json.dump(businesses,f)
'''

from exceptions import ValueError
You don't need to do that at all, ValueError is part of the built-in exceptions, not to mention the fact that you never use it in your code

Related

While loop stops working but the process continues running in python

I use a while loop in Python to download several pdf documents given by a csv file.
The code runs smoothly without any issue but the loop stops working after several loops (sometimes 100 other times 40 or 140).
Below is my code which is used:
import pandas as pd
import os
import urllib
from urllib import request
import requests
import csv
import numpy as np
df = pd.read_csv('Linklist.csv', sep = ';') # can also index sheet by name or fetch all sheets
df.head() #get relevant columns
url_list = df['URL'].tolist() #column with links
name_list = df['Name'].tolist() #column with name
name_list_2 =df['Year'].to_list() #column with second identifier here a year for example
Year_date = []
for element in name_list_2:
Year_date.append(str(element))
max_length = len(url_list)
i = 0
f = open('results.csv', 'w')
writer = csv.writer(f)
while i <= max_length-1:
response = requests.get(url_list[i])
opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15')]
if response.status_code != 200:
i += 1
a = str(response.status_code)
write_a = (name_list[i], Year_date[i], a)
writer.writerow(write_a)
print(name_list[i]+' '+ Year_date[i]+ ' ' +a)
else:
urllib.request.install_opener(opener)
request.urlretrieve( url_list[i],'/targetpath/'+ name_list[i] + Year_date[i] + '.pdf') #.pdf if it is a pdf doc you want to download
b = str(response.status_code)
write_b = (name_list[i], Year_date[i], b)
writer.writerow(write_b)
print(name_list[i]+' '+ Year_date[i] + ' '+ b)
i += 1
f.close()
The information by #barmar and #ogdenkev is correct I needed to integrate a Timeout component!
The working code looks like this now (just integrated the part which I changed):
DEFAULT_TIMEOUT = 180
old_send = requests.Session.send
def new_send(*args, **kwargs):
if kwargs.get("timeout", None) is None:
kwargs["timeout"] = DEFAULT_TIMEOUT
return old_send(*args, **kwargs)
requests.Session.send = new_send
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15'}
while i <= max_length-1:
try:
response = requests.get(url_list[I], headers = headers)
opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15')]
if response.status_code != 200:
a = str(response.status_code)
write_a = (name_list[i], Year_date[i], a)
writer.writerow(write_a)
print(name_list[i]+' '+ Year_date[i]+ ' ' +a)
print(i)
i += 1
else:
urllib.request.install_opener(opener)
request.urlretrieve( url_list[i],'/path/'+ name_list[i] + Year_date[i] + '.pdf') #.pdf if it is a pdf doc you want to download
b = str(response.status_code)
write_b = (name_list[i], Year_date[i], b)
writer.writerow(write_b)
print(name_list[i]+' '+ Year_date[i] + ' '+ b)
print(i)
i += 1
except requests.exceptions.RequestException as e:
c = 'Timeout'
write_a = (name_list[i], Year_date[i], c)
writer.writerow(write_a)
print(name_list[i]+' '+ Year_date[i]+ ' ' +c)
print(i)
i += 1
f.close()

Program runs only with None results

My problem is: my program runs only with None results. I guess there is a problem with the data parameters in my program.
from lxml import html
import requests
etree = html.etree
class News(object):
def __init__(self):
self.url ='https://www.chinatimes.com/newspapers/260118'
self.headers ={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"}
def get_data(self,url):
response =requests.get(url,headers=self.headers)
return response.content
def parse_data(self, data):
# 创建 element对象
data = data.decode()
html = etree.HTML(data)
el_list = html.xpath('/html/body/div[2]/div/div[2]/div/section/ul/li/div/div/div/h3/a/font')
data_list = []
for el in el_list:
temp = {}
temp['title'] = el
temp['link'] = 'https://www.chinatimes.com' + el.xpath("./#href")[0]
data_list.append(temp)
try:
# 获取 下一页的url
next_url = 'https://www.chinatimes.com' + html.xpath('/html/body/div[2]/div/div[2]/div/section/nav/ul/li[7]/a/#href')[0]
except:
next_url = None
return data_list, next_url
def save_data(self, data_list):
for data in data_list:
print(data)
def run(self):
# url
next_url = self.url
while True:
data = self.get_data(next_url)
data_list, next_url = self.parse_data(data)
self.save_data(data_list)
print(next_url)
if next_url == None:
break
if __name__ == '__main__':
news =News()
news.run()
I use Google Chrome, and my XPATH should be correct. I think there is a problem with my data parameter, but I am not sure about it. I hope someone can help me see it. thank you very much. Because when the problem was submitted, the system always said that most of the files I submitted were code, so I can only make up some words, I hope I can submit them.
I thought about it again. It would be better to start from here. Maybe my position is correct, but I don't know how to solve the problem here: data = data.decode()

Crawler script runs without error, but there's no output excel as I expected

I tried to crawl some housing information from a Chinese housing website. The code has no error when I run. However there's no output file when the running process completes.
import requests
from bs4 import BeautifulSoup
import sys
import os
import time
import pandas as pd
import numpy as np
from parsel import Selector
import re
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 BIDUBrowser/8.7 Safari/537.36'
}
def catchHouseList(url):
resp = requests.get(url, headers=headers, stream=True)
if resp.status_code == 200:
reg = re.compile('<li.*?class="clear">.*?<a.*?class="img.*?".*?href="(.*?)"')
urls = re.findall(reg, resp.text)
return urls
return []
def catchHouseDetail(url):
resp = requests.get(url, headers=headers)
print(url)
if resp.status_code == 200:
info = {}
soup = BeautifulSoup(resp.text, 'html.parser')
info['Title'] = soup.select('.main')[0].text
info['Total_Price'] = soup.select('.total')[0].text
info['Unit_Price'] = soup.select('.unit')[0].text
info['Price_per_square'] = soup.select('.unitPriceValue')[0].text
# p = soup.select('.tax')
# info['Reference_price'] = soup.select('.tax')[0].text
info['Built_time'] = soup.select('.subInfo')[2].text
info['Place_Name'] = soup.select('.info')[0].text
info['Area'] = soup.select('.info a')[0].text + ':' + soup.select('.info a')[1].text
info['Lianjia_number'] = str(url)[34:].rsplit('.html')[0]
info['flooring_plan'] = str(soup.select('.content')[2].select('.label')[0].next_sibling)
info['floor'] = soup.select('.content')[2].select('.label')[1].next_sibling
info['Area_Size'] = soup.select('.content')[2].select('.label')[2].next_sibling
info['Flooring_structure'] = soup.select('.content')[2].select('.label')[3].next_sibling
info['Inner_Area'] = soup.select('.content')[2].select('.label')[4].next_sibling
info['Building_Category'] = soup.select('.content')[2].select('.label')[5].next_sibling
info['House_Direction'] = soup.select('.content')[2].select('.label')[6].next_sibling
info['Building_Structure'] = soup.select('.content')[2].select('.label')[7].next_sibling
info['Decoration'] = soup.select('.content')[2].select('.label')[8].next_sibling
info['Stair_Number'] = soup.select('.content')[2].select('.label')[9].next_sibling
info['Heating'] = soup.select('.content')[2].select('.label')[10].next_sibling
info['Elevator'] = soup.select('.content')[2].select('.label')[11].next_sibling
# info['Aseest_Year'] = str(soup.select('.content')[2].select('.label')[12].next_sibling)
return info
pass
def appendToXlsx(info):
fileName = './second_hand_houses.xlsx'
dfNew = pd.DataFrame([info])
if (os.path.exists(fileName)):
sheet = pd.read_excel(fileName)
dfOld = pd.DataFrame(sheet)
df = pd.concat([dfOld, dfNew])
df.to_excel(fileName)
else:
dfNew.to_excel(fileName)
def catch():
pages = ['https://zs.lianjia.com/ershoufang/guzhenzhen/pg{}/'.format(x) for x in range(1, 21)]
for page in pages:
print(page)
houseListURLs = catchHouseList(page)
for houseDetailUrl in houseListURLs:
try:
info = catchHouseDetail(houseDetailUrl)
appendToXlsx(info)
except:
pass
time.sleep(2)
pass
if __name__ == '__main__':
catch()
I expected to have an excel output, but there's nothing in the end. Only telling me that the Process finished with exit code 0.
Here's one of your problem areas, with a little rewrite to help you see it. You were returning an empty list when that status code was anything other than 200, without any warning or explanation. The rest of your script requires a list to continue running. When you return an empty list, it exits cleanly.
Now, when you run your code, this function is going to return None when the server response isn't 200, and then a TypeError is going to be raised in your catch() function, which will require further error handling.
def catchHouseList(url):
try:
resp = requests.get(url, headers=headers, stream=True)
if resp.status_code == 200:
reg = re.compile(
'<li.*?class="clear">.*?<a.*?class="img.*?".*?href="(.*?)"')
urls = re.findall(reg, resp.text)
return urls
else:
print('catchHouseList response code:', resp.status_code)
except Exception as e:
print('catchHouseList:', e)

How can I download high resolution images from google use python + selenium + phantomJS

I want to fetch more than 100 high resolution images from google, use python2.7 + selenium + PhantomJS.
But since I act like they said, I could only get a webpage with small images on it. And I can't find out any link to the high resolution pictures directly. How could I fix it?
My code is as below.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
class ImgCrawler:
def __init__(self,searchlink = None):
self.link = searchlink
self.soupheader = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
self.scrolldown = None
self.jsdriver = None
def getPhantomJSDriver(self):
self.jsdriver = webdriver.PhantomJS()
self.jsdriver.get(self.link)
def scrollDownUsePhatomJS(self, scrolltimes = 1, sleeptime = 10):
for i in range(scrolltimes):
self.jsdriver.execute_script('window.scrollTo(0,document.body.scrollHeight);')
time.sleep(sleeptime)
def getSoup(self, parser=None):
print 'a', self.jsdriver.page_source
return BeautifulSoup(self.jsdriver.page_source, parser)
def getActualUrl(self, soup=None, flag=None, classflag=None, jsonflaglink=None, jsonflagtype=None):
actualurl = []
for a in soup.find_all(flag, {"class": classflag}):
link = json.loads(a.text)[jsonflaglink]
filetype = json.loads(a.text)[jsonflagtype]
detailurl = link + u'.' + filetype
actualurl.append(detailurl)
return actualurl
if __name__ == '__main__':
search_url = "https://www.google.com.hk/search?safe=strict&hl=zh-CN&site=imghp&tbm=isch&source=hp&biw=&bih=&btnG=Google+%E6%90%9C%E7%B4%A2&q="
queryword = raw_input()
query = queryword.split()
query = '+'.join(query)
weblink = search_url + query
img = ImgCrawler(weblink)
img.getPhantomJSDriver()
img.scrollDownUsePhatomJS(2,5)
soup = img.getSoup('html.parser')
print weblink
print soup
actualurllist = img.getActualUrl(soup,'div','rg_meta','ou','ity')
print len(actualurllist)
I tried for a long time to use PhantomJS but ended up using Chrome which is not what you asked for I know, but it works. I could not get it to work with PhantomJS.
First get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads you can use a headless version of chrome "Chrome Canary" if you are on Windows.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import re
import urlparse
class ImgCrawler:
def __init__(self,searchlink = None):
self.link = searchlink
self.soupheader = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
self.scrolldown = None
self.jsdriver = None
def getPhantomJSDriver(self):
self.jsdriver = webdriver.Chrome()
self.jsdriver.get(self.link)
def scrollDownUsePhatomJS(self, scrolltimes = 1, sleeptime = 10):
for i in range(scrolltimes):
self.jsdriver.execute_script('window.scrollTo(0,document.body.scrollHeight);')
time.sleep(sleeptime)
def getSoup(self, parser=None):
print 'a', self.jsdriver.page_source
return BeautifulSoup(self.jsdriver.page_source, parser)
def getActualUrl(self, soup=None):
actualurl = []
r = re.compile(r"/imgres\?imgurl=")
for a in soup.find_all('a', href=r):
parsed = urlparse.urlparse(a['href'])
url = urlparse.parse_qs(parsed.query)['imgurl']
actualurl.append(url)
print url
return actualurl
if __name__ == '__main__':
search_url = "https://www.google.com.hk/search?safe=strict&hl=zh-CN&site=imghp&tbm=isch&source=hp&biw=&bih=&btnG=Google+%E6%90%9C%E7%B4%A2&q="
queryword = raw_input()
query = queryword.split()
query = '+'.join(query)
weblink = search_url + query
img = ImgCrawler(weblink)
img.getPhantomJSDriver()
img.scrollDownUsePhatomJS(2,5)
soup = img.getSoup('html.parser')
print weblink
print soup
actualurllist = img.getActualUrl(soup)
print len(actualurllist)
I changed getActualUrl() to get the image url from an "a" element with a "href" attribute starting with "/imgres?imgurl="
Outputs (when "hazard" is typed in to the terminal):
[u'https://static.independent.co.uk/s3fs-public/styles/article_small/public/thumbnails/image/2016/12/26/16/eden-hazard.jpg']
[u'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b1/EdenHazardDecember_2016.jpg/200px-EdenHazardDecember_2016.jpg']
[u'http://a.espncdn.com/combiner/i/?img=/photo/2016/1227/r166293_1296x729_16-9.jpg&w=738&site=espnfc']
[u'https://platform-static-files.s3.amazonaws.com/premierleague/photos/players/250x250/p42786.png']
[u'https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Eden_Hazard_-_DK-Chel15_%286%29.jpg/150px-Eden_Hazard_-_DK-Chel15_%286%29.jpg']
[u'http://a.espncdn.com/combiner/i/?img=/photo/2017/0117/r172004_1296x729_16-9.jpg&w=738&site=espnfc']
[u'http://images.performgroup.com/di/library/GOAL/98/c0/eden-hazard-chelsea_1eohde060wvft1elcskrgihxq3.jpg?t=-1500575837&w=620&h=430']
[u'http://e1.365dm.com/17/03/16-9/20/skysports-eden-hazard-chelsea_3918835.jpg?20170331154242']
[u'http://a.espncdn.com/combiner/i/?img=/photo/2017/0402/r196036_1296x729_16-9.jpg&w=738&site=espnfc']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/04/nintchdbpict000316361045.jpg?strip=all&w=670&quality=100']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2017/02/10/14/eden-hazard1.jpg']
[u'http://s.newsweek.com/sites/www.newsweek.com/files/2016/11/07/eden-hazard.jpg']
[u'http://www.newstube24.com/wp-content/uploads/2017/06/haz.jpg']
[u'http://images.performgroup.com/di/library/GOAL/17/b1/eden-hazard_68ypnelg3lfd14oxkffztftt6.png?t=-1802977526&w=620&h=430']
[u'https://upload.wikimedia.org/wikipedia/commons/thumb/e/eb/DK-Chel15_%288%29.jpg/220px-DK-Chel15_%288%29.jpg']
[u'http://images.performgroup.com/di/library/omnisport/50/3f/hazard-cropped_3y08vc3ejpua168e9mgvu4mwc.jpg?t=-930203025&w=620&h=430']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/03/nintchdbpict000291621611-e1490777105213.jpg?strip=all&w=745&quality=100']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2016/01/14/14/Eden-Hazard.jpg']
[u'https://upload.wikimedia.org/wikipedia/commons/thumb/5/56/Eden_Hazard%2713-14.JPG/150px-Eden_Hazard%2713-14.JPG']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/03/nintchdbpict000296311943-e1490777241155.jpg?strip=all&w=596&quality=100']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2016/01/27/11/hazard.jpg']
[u'http://www.newzimbabwe.com/FCKEditor_Images/Eden-Hazard-896286.jpg']
[u'http://images.performgroup.com/di/library/GOAL/9c/93/eden-hazard_d4lbib7wdagw1hp2e5gnyov0k.jpg?t=-1763574189&w=620&h=430']
[u'http://www.guoguiyan.com/data/out/94/69914569-hazard-wallpapers.jpg']
[u'http://static.guim.co.uk/sys-images/Football/Pix/pictures/2015/4/16/1429206099512/Eden-Hazard-009.jpg']
[u'https://metrouk2.files.wordpress.com/2017/04/pri_37621532.jpg?w=620&h=406&crop=1']
[u'http://alummata.com/wp-content/uploads/2016/04/Hazard.jpg']
[u'https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/Hazard_vs_Norwich_%28crop%29.jpg/150px-Hazard_vs_Norwich_%28crop%29.jpg']
[u'http://i.dailymail.co.uk/i/pix/2016/11/06/20/3A185FB800000578-3910886-image-a-46_1478462522592.jpg']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/467822629-465742.jpg']
[u'http://i.dailymail.co.uk/i/pix/2015/10/17/18/2D81CB1D00000578-0-image-a-37_1445102645249.jpg']
[u'http://images.performgroup.com/di/library/GOAL_INTERNATIONAL/27/ce/eden-hazard_1kepw6rvweted1hpfmp5xwd5cs.jpg?t=-228379025&w=620&h=430']
[u'http://img.skysports.com/16/12/768x432/skysports-chelsea-manchester-city-eden-hazard_3845204.jpg?20161203162258']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/03/nintchdbpict0003089026162.jpg?strip=all&w=960&quality=100']
[u'http://www.chelseafc.com/content/cfc/en/homepage/news/boilerplate-config/latest-news/2016/05/hazard--rediscovering-our-form.img.png']
[u'http://images.performgroup.com/di/library/omnisport/b5/98/hazard-cropped_172u0n8wx4j071cvgs1n3yycvw.jpg?t=2030908123&w=620&h=430']
[u'http://images.indianexpress.com/2016/05/eden-hazard-m.jpg']
[u'http://i2.mirror.co.uk/incoming/article9755579.ece/ALTERNATES/s615/PAY-Chelsea-v-Arsenal-Premier-League.jpg']
[u'http://www.chelseafc.com/content/cfc/en/homepage/news/boilerplate-config/latest-news/2017/06/hazard-injury-update.img.png']
[u'http://futhead.cursecdn.com/static/img/fm/17/players/183277_HAZARDCAM7.png']
[u'http://images.performgroup.com/di/library/GOAL/4d/6/eden-hazard-chelsea-06032017_enh1ll3uadj01ocstyopie9e4.jpg?t=-1521106510&w=620&h=430']
[u'http://images.performgroup.com/di/library/GOAL/34/8/eden-hazard-chelsea-southampton_1oca1rpy37gmn1ldmqvytti3k4.jpg?t=-1501721805&w=620&h=430']
[u'http://media.gettyimages.com/photos/eden-hazard-of-chelsea-celebrates-scoring-his-sides-third-goal-during-picture-id617452212?s=612x612']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/secondary/Eden-Hazard-889782.jpg']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2015/10/19/16/Hazard.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/03/nintchdbpict000307050894.jpg?strip=all&w=660&quality=100']
[u'http://e1.365dm.com/16/11/16-9/20/skysports-eden-hazard-chelsea-football_3833122.jpg?20161116153005']
[u'http://thumb.besoccer.com/media/img_news/morata-y-hazard--besoccer.jpg']
[u'https://static.independent.co.uk/s3fs-public/styles/story_medium/public/thumbnails/image/2017/03/13/21/10-hazard.jpg']
[u'https://static.independent.co.uk/s3fs-public/styles/story_medium/public/thumbnails/image/2016/12/27/13/hazard.jpg']
[u'http://images.performgroup.com/di/library/GOAL/63/2a/eden-hazard-chelsea_15ggj1j39rmky1c5a20oxt3tly.jpg?t=1297762370']
[u'http://i1.mirror.co.uk/incoming/article9755531.ece/ALTERNATES/s615b/Chelsea-v-Arsenal-Premier-League.jpg']
[u'http://cf.c.ooyala.com/l2YmpyYjE6yvLSxGiEebNMr3N1ANS1Xc/O0cEsGv5RdudyPNn4xMDoxOjBnO_4SLA']
[u'http://media.gettyimages.com/photos/eden-hazard-of-chelsea-celebrates-scoring-his-sides-third-goal-during-picture-id617452006?s=612x612']
[u'http://a.espncdn.com/combiner/i/?img=/photo/2017/0413/r199412_2_1296x729_16-9.jpg&w=738&site=espnfc']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/04/nintchdbpict000318703991-e1493109803795.jpg?strip=all&w=960&quality=100']
[u'https://static.independent.co.uk/s3fs-public/styles/story_medium/public/thumbnails/image/2016/11/18/14/hazard-award.jpg']
[u'http://static.goal.com/2477200/2477282_heroa.jpg']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2016/12/16/14/eden-hazard.jpg']
[u'http://www.guoguiyan.com/data/out/94/69979129-hazard-wallpapers.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2016/11/nintchdbpict0002769424741.jpg?w=960&strip=all']
[u'http://v.uecdn.es/p/110/thumbnail/entry_id/0_ofavjqr8/width/660/cache_st/20170327164629/type/2/bgcolor/000000/0_ofavjqr8.jpg']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2017/02/07/10/edenhazard.jpg']
[u'http://theworldgame.sbs.com.au/sites/sbs.com.au.theworldgame/files/styles/full/public/images/e/d/eden-hazard-cropped_g6m28ldoc0b41p3f5sp2vlflt.jpg?itok=XW5M7QEA']
[u'http://e0.365dm.com/17/03/16-9/20/skysports-eden-hazard-chelsea_3909051.jpg?20170314181126']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/secondary/Chelsea-Hazard-goals-886084.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/04/nintchdbpict000319343894.jpg?strip=all&w=960&quality=100']
[u'https://www.footyrenders.com/render/Eden-Hazard-PL.png']
[u'http://media.gettyimages.com/photos/eden-hazard-of-chelsea-celebrates-as-he-scores-their-first-goal-the-picture-id672946758?s=612x612']
[u'https://pbs.twimg.com/profile_images/791664465729163264/XbCVl6BF.jpg']
[u'https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/Eden_Hazard_2011.jpg/170px-Eden_Hazard_2011.jpg']
[u'http://s.newsweek.com/sites/www.newsweek.com/files/2016/02/01/guus-hiddink-says-eden-hazard-could-leave-chelsea..jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2016/06/chelsea_hazard_mobile_top.jpg?strip=all&w=750&h=352&crop=1']
[u'http://cdn.images.dailystar.co.uk/dynamic/58/photos/735000/Eden-Hazard-887735.jpg']
[u'http://i.telegraph.co.uk/multimedia/archive/03580/Hazard_Real_copy_3580583b.jpg']
[u'https://premierleague-static-files.s3.amazonaws.com/premierleague/photo/2017/05/21/47a1f452-43e4-4215-a5b8-5043c1e12a07/686302908.jpg']
[u'http://www.chelseafc.com/content/cfc/en/homepage/news/boilerplate-config/latest-news/2017/06/hazard-s-highlights.img.png']
[u'http://i.dailymail.co.uk/i/pix/2016/12/14/15/3B45B4D300000578-4032306-Hazard_PFA_Player_of_the_Year_in_2015_has_rediscovered_his_form_-a-6_1481728291902.jpg']
[u'https://img.rasset.ie/000d5137-800.jpg']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/Eden-Hazard-665260.jpg']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/Eden-Hazard-659164.jpg']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/secondary/Hazard-scored-Chelsea-s-third-goal-against-Tottenham-909804.jpg']
[u'http://cdn.images.dailystar.co.uk/dynamic/58/photos/237000/Eden-Hazard-887237.jpg']
[u'http://a.espncdn.com/combiner/i/?img=/media/motion/ESPNi/2017/0405/int_170405_Hazard_the_successor_to_Ronaldo_at_Real/int_170405_Hazard_the_successor_to_Ronaldo_at_Real.jpg&w=738&site=espnfc']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/Eden-Hazard-721522.jpg']
[u'http://media.gettyimages.com/photos/eden-hazard-of-chelsea-celebrates-with-teammates-after-scoring-his-picture-id633776492?s=612x612']
[u'http://betinmalta.com/wp-content/uploads/2017/05/hazard.jpg']
[u'http://cdn.images.dailystar.co.uk/dynamic/58/photos/708000/Eden-Hazard-Chelsea-712708.jpg']
[u'http://images.performgroup.com/di/library/omnisport/c9/4a/eden-hazard-cropped_12u5irb6bkze1cue2wpjzxa44.jpg?t=-2084914038&w=620&h=430']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/06/nintchdbpict0003291256741.jpg?strip=all&w=714&quality=100']
[u'https://premierleague-static-files.s3.amazonaws.com/premierleague/photo/2017/03/10/f97d36aa-1eef-4a78-996f-63d543c79efc/700017169TS004_Eden_Hazard_.JPG']
[u'https://s-media-cache-ak0.pinimg.com/736x/f0/01/17/f001178defb2b3be3cffb5e9b792748b--eden-hazard-liverpool-england.jpg']
[u'http://i4.mirror.co.uk/incoming/article9898829.ece/ALTERNATES/s615b/hazard-2.jpg']
[u'http://images.performgroup.com/di/library/GOAL/24/76/eden-hazard-and-lionel-messi_kamv8simc20x1p2i2fcf7lllw.png?t=421166242&w=620&h=430']
[u'https://metrouk2.files.wordpress.com/2017/03/gettyimages-618471206.jpg?w=748&h=498&crop=1']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/Hazard-Chelsea-658138.jpg']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2017/04/23/17/hazard.jpg']
[u'http://e0.365dm.com/15/10/16-9/20/eden-hazard-comparison-chelsea_3365521.jpg?20151018152317']
[u'http://cdn.images.express.co.uk/img/dynamic/galleries/x701/231048.jpg']
[u'http://cdn.images.express.co.uk/img/dynamic/galleries/x701/102742.jpg']
[u'https://i.ytimg.com/vi/GWhVkFTe_BY/maxresdefault.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/05/nintchdbpict000319343748-e1496260888520.jpg?strip=all&w=960&quality=100']
[u'https://metrouk2.files.wordpress.com/2017/06/689818982.jpg?w=748&h=498&crop=1']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/03/nintchdbpict000308902470.jpg?strip=all&w=960&quality=100']
[u'https://www.thesun.co.uk/wp-content/uploads/2016/12/nintchdbpict000289569836.jpg?w=960&strip=all']
[u'https://i.ytimg.com/vi/zZ9stt70_vU/maxresdefault.jpg']
[u'https://upload.wikimedia.org/wikipedia/commons/2/2c/Kylian_Hazard_%28cropped%29.jpg']
[u'http://e00-marca.uecdn.es/assets/multimedia/imagenes/2017/03/26/14905092504845.jpg']
[u'http://images.performgroup.com/di/library/omnisport/ba/47/eden-hazard-cropped_rccnpv1me3v51kqpnj5ak4nko.jpg?t=1222186324&w=620&h=430']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/secondary/Eden-Hazard-922505.jpg']
[u'https://s-media-cache-ak0.pinimg.com/736x/48/ce/4c/48ce4c478d8b06dccacce352d9d4bdc2--eden-hazard-pogba.jpg']
[u'http://www.telegraph.co.uk/content/dam/football/2016/10/23/111897755_Editorial_use_only_No_merchandising_For_Football_images_FA_and_Premier_League_restrict-large_trans_NvBQzQNjv4BqqVzuuqpFlyLIwiB6NTmJwfSVWeZ_vEN7c6bHu2jJnT8.jpg']
[u'https://metrouk2.files.wordpress.com/2017/06/686902184.jpg?w=748&h=652&crop=1']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/03/nintchdbpict0003086723271-e1489501904590.jpg?strip=all&w=960&quality=100']
[u'http://e2.365dm.com/17/01/16-9/20/skysports-chelsea-manchester-city-eden-hazard_3862340.jpg?20170406190414']
[u'http://www.telegraph.co.uk/content/dam/football/2017/05/26/TELEMMGLPICT000129483487-large_trans_NvBQzQNjv4BqajCpFXsei0OXjDFGPZkcdJOkVdu-K0ystYH4SV7DHn8.jpeg']
[u'https://i.ytimg.com/vi/FFE4Ea437ks/maxresdefault.jpg']
[u'https://i1.wp.com/www.vanguardngr.com/wp-content/uploads/2017/03/Hazard-madrid.png?resize=350%2C200']
[u'http://china.chelseafc.com/content/cfc/zh/homepage/teams/first-team/eden-hazard/summary/_jcr_content/tabparmain/box/box/textimage/image.img.jpg/1496846329140.jpg']
[u'http://static.goal.com/198700/198707_news.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/05/nintchdbpict000319357531.jpg?strip=all&w=960&quality=100']
[u'http://media.gettyimages.com/photos/eden-hazard-of-chelsea-poses-with-the-premier-league-trophy-after-the-picture-id686826640?s=612x612']
[u'http://cf.c.ooyala.com/t3dXdzYjE6VJktcnKdi7F2205I_mSSKQ/eWNh-8akTAF2kj8X4xMDoxOjBnO_4SLA']
[u'http://c.smimg.net/16/39/300x225/eden-hazard.jpg']
[u'http://www.whatfootballersearn.com/wp-content/uploads/Eden-Hazard.jpg']
[u'http://cdn.images.dailystar.co.uk/dynamic/58/photos/328000/Eden-Hazard-437328.jpg']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/secondary/Eden-Hazard-Chelsea-882846.jpg']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/Chelsea-star-Eden-Hazard-741161.jpg']
[u'https://talksport.com/sites/default/files/styles/taxonomy-img/public/field/image/201703/hazard_0.jpg']
[u'http://i.dailymail.co.uk/i/pix/2016/08/28/21/37A101A700000578-3762573-image-a-19_1472417354943.jpg']
[u'http://www.telegraph.co.uk/content/dam/football/2016/07/27/87650659-edenhazard-sport-large_trans_NvBQzQNjv4BqqVzuuqpFlyLIwiB6NTmJwfSVWeZ_vEN7c6bHu2jJnT8.jpg']
[u'http://cdn.images.dailystar.co.uk/dynamic/58/photos/165000/620x/Eden-Hazard-598896.jpg']
[u'http://i.dailymail.co.uk/i/pix/2016/05/04/21/33C8A26600000578-0-image-a-19_1462392130112.jpg']
[u'https://ichef-1.bbci.co.uk/news/660/cpsprodpb/13AA1/production/_96354508_595836d4-f21a-419b-95cb-37a65204f6eb.jpg']
[u'https://premierleague-static-files.s3.amazonaws.com/premierleague/photo/2016/11/30/1eb421ae-b210-4a01-95bb-36f509826cc1/Debruyne_v_Hazard.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/03/nintchdbpict000309639487-e1490040388851.jpg?strip=all&w=739&quality=100']
[u'http://static.goal.com/3311200/3311292_heroa.jpg']
[u'http://i3.mirror.co.uk/incoming/article7986781.ece/ALTERNATES/s615b/Hazard-and-son.jpg']
[u'http://a.espncdn.com/combiner/i/?img=/photo/2016/0916/r126535_1296x729_16-9.jpg&w=738&site=espnfc']
[u'http://www.chelseafc.com/content/cfc/en/homepage/news/boilerplate-config/latest-news/2017/03/hazard-score-is-number-one-.img.png']
[u'https://static.independent.co.uk/s3fs-public/thumbnails/image/2015/03/17/13/eden-hazard.jpg']
[u'https://metrouk2.files.wordpress.com/2017/05/680506564.jpg?w=748&h=457&crop=1']
[u'http://media.gettyimages.com/photos/eden-hazard-of-chelsea-celebrates-with-diego-costa-of-chelsea-after-picture-id671559962?s=612x612']
[u'http://e0.365dm.com/17/05/16-9/20/skysports-eden-hazard-chelsea_3965489.jpg?20170529101357']
[u'https://s-media-cache-ak0.pinimg.com/736x/e0/80/0e/e0800e380ef363594fb292969b7c5b64--eden-hazard-chelsea-soccer.jpg']
[u'http://cdn-football365.365.co.za/content/uploads/2016/12/GettyImages.630542828.jpg']
[u'http://i.dailymail.co.uk/i/pix/2016/07/16/19/340E4A9A00000578-3693637-image-a-84_1468694248523.jpg']
[u'http://www.squawka.com/news/wp-content/uploads/2017/01/hazard-chelsea-e1485528066609.jpg']
[u'http://www.guoguiyan.com/data/out/94/68880901-hazard-wallpapers.jpg']
[u'http://www.telegraph.co.uk/content/dam/football/2017/03/12/JS122962983_EHazDavid-Rose-large_trans_NvBQzQNjv4BqtA9hvt4yaDuJhaJG2frTIUNrh1MdssoHpGF6OIxC49c.jpg']
[u'http://images.performgroup.com/di/library/GOAL/17/25/eden-hazard-chelsea_1dsnlf2z113cx10nxvp9ydudcz.jpg?t=2008335075']
[u'http://www.telegraph.co.uk/content/dam/football/2016/12/05/115182685-eden-hazard-sport-large_trans_NvBQzQNjv4BqA7a2BP2KFPtZUOepzpZgXISdNn8DgVUcalGVREaviFE.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/04/sport-preview-morata-hazard.jpg?strip=all&quality=100&w=750&h=500&crop=1']
[u'http://thumb.besoccer.com/media/img_news/eden-hazard--futbolista-del-chelsea--chelseafc.jpg']
[u'http://i4.mirror.co.uk/incoming/article7374471.ece/ALTERNATES/s615/Chelsea-Training-Session.jpg']
[u'https://metrouk2.files.wordpress.com/2017/04/671549404.jpg?w=748&h=532&crop=1']
[u'https://metrouk2.files.wordpress.com/2016/02/462363538.jpg?w=748&h=563&crop=1']
[u'https://metrouk2.files.wordpress.com/2017/05/6834221661.jpg?w=748&h=507&crop=1']
[u'http://cdn.images.express.co.uk/img/dynamic/67/590x/Chelsea-star-Eden-Hazard-739447.jpg']
[u'http://cdn.quotesgram.com/img/21/41/114220036-24CA6E4700000578-2916442-Eden_Hazard_has_been_instrumental_for_Chelsea_this_season_as_the-a-7_1421682779132.jpg']
[u'http://i.dailymail.co.uk/i/pix/2016/09/29/09/38E4401500000578-3813294-image-a-1_1475138637248.jpg']
[u'http://healthyceleb.com/wp-content/uploads/2016/04/Eden-Hazard-match-between-Chelsea-Milton-Keynes-Dons-January-2016.jpg']
[u'https://talksport.com/sites/default/files/styles/taxonomy-img/public/field/image/201704/gettyimages-663029916.jpg']
[u'https://upload.wikimedia.org/wikipedia/commons/thumb/2/29/Thorgan_Hazard_2014.jpg/220px-Thorgan_Hazard_2014.jpg']
[u'http://cdn.images.dailystar.co.uk/dynamic/58/photos/91000/620x/Eden-Hazard-632850.jpg']
[u'http://i4.mirror.co.uk/incoming/article7531141.ece/ALTERNATES/s615/A-dejected-looking-Eden-Hazard.jpg']
[u'https://www.thesun.co.uk/wp-content/uploads/2017/05/nintchdbpict000322464448-e1494602676644.jpg?strip=all&w=960&quality=100']
[u'http://images.performgroup.com/di/library/GOAL_INTERNATIONAL/76/92/chelsea-bournemouth-eden-hazard_148j4p900kzba159diso6ewwvo.jpg?t=1004329665&w=620&h=430']
[u'https://images.cdn.fourfourtwo.com/sites/fourfourtwo.com/files/styles/inline-image/public/hazard3.jpg?itok=ap0DtuZx']
[u'https://talksport.com/sites/default/files/styles/taxonomy-img/public/field/image/201707/hazard.jpg']
...
[u'http://a.espncdn.com/combiner/i/?img=/photo/2017/0330/r195127_1296x729_16-9.jpg&w=738&site=espnfc']
299

Python + lxml + etree Encoding issue

I'm trying to parse some pages by using this code:
import urllib.request
import requests
from lxml import etree
s = requests.session()
s.headers.update({
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0)
Gecko/20100101 Firefox/45.0'
})
results = open("res.txt", "w")
for i in range(510077, 2780673):
results = open("res.txt", "a")
print(i)
url = "url" + str(i) + "&print=true"
try:
content = s.get(url).text
tree = etree.HTML(content)
a = str(tree.xpath("//*[#class='prob_nums']")[0].text)
b = etree.tostring(tree.xpath("//*[#class='pbody']")[0])
c = etree.tostring(tree.xpath("//*[#class='nobreak solution']")[0])
results.writelines("%s %s %s" % (a, b, c))
results.close()
except Exception:
print("error")
But have a problem with output:
(fragment)
<p class="left_margin">На доске на­пи­са
How to convert these symbols to normal text? Thank you

Categories

Resources