I've been working for a program over the last few days which should download a range of pages from the webcomic Homestuck. I've created a working version in python 3, but it is horribly inefficient. Can anyone see ways to improve and shorten this code?
import urllib.request
range1 = int(input("Enter the 1st page you want: "))
range2 = int(input("Enter the last page you want: ")) + 1
current = range1 + 1900
final = range2 + 1900
page = ''
nextPage = ''
while current != final:
page = str(current)
nextPage = str(current+1)
while len(page) != 6:
page = '0'+ page
while len(nextPage) != 6:
nextPage = '0'+ nextPage
html = 'http://www.mspaintadventures.com/?s=6&p='+page
site = urllib.request.urlopen(html)
s = site.read()
s = s.decode("utf8")
s = s.replace("<!-- end comic content -->", "<!-- begin comic content -->")
s = s.replace("http://cdn.mspaintadventures.com/storyfiles/hs2/", "")
s = s.replace("?s=6&p=" + str(nextPage), str(int(nextPage))+".html")
s = s.replace(page+"/"+page, page)
a,b,c = s.split('<!-- begin comic content -->')
b = "<title> Page " + page + "</title>" + b
t = open(str(current)+".html", 'w+')
t.write(b)
t.close()
page = str((int(page)-1900))
while len(page) != 5:
page = '0'+ page
t = open(str(current)+".html", 'a')
swfname=page+".swf"
t.write("<object width='1000' height='1000'> <param name='movie' value='"+swfname+"'>")
t.write("<embed src="+swfname+" width=650 height=450>")
t.write("</embed>")
t.write("</object>")
t.close()
try:
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+".gif"
urllib.request.urlretrieve(img, page+".gif")
except:
try:
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+"_1.gif"
urllib.request.urlretrieve(img, page+"_1.gif")
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+"_2.gif"
urllib.request.urlretrieve(img, page+"_2.gif")
except:
try:
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+"/"+page+".swf"
urllib.request.urlretrieve(img, page+".swf")
except:
print("Image "+img+" failed to download")
print ("Page " + str(page) + " of " + str(final-1901) + " downloaded")
current+=1
print("DONE")
1) I don't understand these lines :
t = open(str(current)+".html", 'w+')
t.write
2) You should avoid multiple writing in a file. It's better to use string formatting and then write once.
text='''<object width='1000' height='1000'> <param name='movie' value='{0}'>
<embed src="{0}" width=650 height=450>
</embed>
</object>'''.format(swfname)
t.write(text)
t.close()
Related
First off I want to apologize to everyone who's about to read this code... I know it's a mess.
For anyone who is able to decipher it: I have a list of ~16,500 website URL's that I am scraping and then using googles NLP to categorize. The list of URL's is created with the following chunk of code, as far as I can tell nothing is broken here.
url_list = open("/Users/my_name/Documents/Website Categorization and Scrapper Python/url_list copy", "r")
indexed_url_list = url_list.readlines()
clean_url_list = []
clean_url_list = [x[:-1] for x in indexed_url_list]
When I print the length of this list it correctly gives me the count of ~16,500.
The main block of code is as follows:
for x in clean_url_list:
print('1')
url = x
print('1.1')
try:
r = scraper.get(url, headers = headers,)
print('1.2')
soup = BeautifulSoup(r.text, 'html.parser')
print('1.3')
title = soup.find('title').text
print('1.4')
description = soup.find('meta', attrs={'name': 'description'})["content"]
print('2')
if "content" in str(description):
description = description.get("content")
else:
description = ""
h1 = soup.find_all('h1')
h1_all = ""
for x in range (len(h1)):
if x == len(h1) -1:
h1_all = h1_all + h1[x].text
else:
h1_all = h1_all + h1[x].text + ". "
paragraphs_all = ""
paragraphs = soup.find_all('p')
for x in range (len(paragraphs)):
if x == len(paragraphs) -1:
paragraphs_all = paragraphs_all + paragraphs[x].text
else:
paragraphs_all = paragraphs_all + paragraphs[x].text + ". "
h2 = soup.find_all('h2')
h2_all = ""
for x in range (len(h2)):
if x == len(h2) -1:
h2_all = h2_all + h2[x].text
else:
h2_all = h2_all + h2[x].text + ". "
h3 = soup.find_all('h3')
h3_all = ""
for x in range (len(h3)):
if x == len(h3) -1:
h3_all = h3_all + h3[x].text
else:
h3_all = h3_all + h3[x].text + ". "
allthecontent = ""
allthecontent = str(title) + " " + str(description) + " " + str(h1_all) + " " + str(h2_all) + " " + str(h3_all) + " " + str(paragraphs_all)
allthecontent = str(allthecontent)[0:999]
print(allthecontent)
except Exception as e:
print(e)
When I run this it successfully categorizes the first 49 URL's, but ALWAYS stops on the 50th, no matter what URL it is. No error is thrown, and even if it did the try/except should handle it. Using the print statements to debug it seems to not enter the "try" section on the 50th iteration for whatever reason and it's always the 50th iteration
Any help would be much appreciated and I hope you have some good eye wash to wipe away the code you just had to endure.
I helped look at this at work. The actual issue was a bad 50th url that would not return. Adding a timeout allowed the code to escape the try/catch block and move on to the next url in a manageable fashion.
try:
r = scraper.get(url, headers = headers, timeout=5)
except:
continue # handle next url in list
I want a multiple variable and I want to append in list and then when I write that list should be assign to different columns
e. Page_name, followers, following and I want to store this in list and write down in excel. How to perform this action?
liked_list = []
details = []
for post in posts:
liked_list = []
driver.get(post)
try:
a_tag = wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(#href,'/liked_by/')]")))
liked_ = a_tag.find_elements(By.XPATH, ".//descendant::span")
for i in liked_:
liked_list.append(i.text)
likes = liked_list[0]
print(post)
print('Likes = ' + likes)
time.sleep(5)
comments_1 = wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(#href,'/comments/')]")))
# comments_2 = comments_1.find_elements(By.XPATH, ".//descendant::span")
comments_2 = comments_1.text
comments_3 = comments_2.split(' ')
print('Comments = ' + comments_3[2])
comment_href = comments_1.get_attribute('href')
driver.get(comment_href)
time.sleep(5)
post_caption = driver.find_elements(By.CLASS_NAME, "MOdxS ")
time.sleep(5)
comments = []
comments_count = 1
for i in post_caption:
if comments_count <= 10:
print('Comment = '+ i.text)
comments.append(i.text)
comments_count += 1
else:
break
details.extend([page_name,followers,following,comments_3[2],comments[0]])
except:
print('Continue')
continue
I am trying to scrape the data off of this post. I am having an issue with scraping the comments however. The pagination of the comments is determined by the "page=1" at the end of the url. I noticed that if "page=0" is used it loads all the comments on one page which is really nice. However, my scrapy script will only scrape the comments from the first page, no matter what. Even if I change the link to "page=2" it still will only scrape the comments from the first page. I can not figure out why this issue is occurring.
import scrapy
from scrapy.crawler import CrawlerProcess
class IdeaSpider(scrapy.Spider):
name = "IdeaSpider"
def start_requests(self):
yield scrapy.Request(
url="https://www.games2gether.com/amplitude-studios/endless-space-2/ideas/1850-force-infinite-actions-to"
"-the-bottom-of-the-queue?page=0", callback=self.parse_idea)
# parses title, post, status, author, date
def parse_idea(self, response):
post_author = response.xpath('//span[#class = "username-content"]/text()')
temp_list.append(post_author.extract_first())
post_categories = response.xpath('//a[#class = "list-tags-item ng-star-inserted"]/text()')
post_categories_ext = post_categories.extract()
if len(post_categories_ext) > 1:
post_categories_combined = ""
for category in post_categories_ext:
post_categories_combined = post_categories_combined + category + ", "
temp_list.append(post_categories_combined)
else:
temp_list.append(post_categories_ext[0])
post_date = response.xpath('//div[#class = "time-date"]/text()')
temp_list.append(post_date.extract_first())
post_title = response.xpath('//h1[#class = "title"]/text()')
temp_list.append(post_title.extract()[0])
post_body = response.xpath('//article[#class = "post-list-item clearfix ng-star-inserted"]//div[#class = '
'"post-list-item-message-content post-content ng-star-inserted"]//text()')
post_body_ext = post_body.extract()
if len(post_body_ext) > 1:
post_body_combined = ""
for text in post_body_ext:
post_body_combined = post_body_combined + " " + text
temp_list.append(post_body_combined)
else:
temp_list.append(post_body_ext[0])
post_status = response.xpath('//p[#class = "status-title"][1]/text()')
if len(post_status.extract()) != 0:
temp_list.append(post_status.extract()[0])
else:
temp_list.append("no status")
dev_name = response.xpath('//div[#class = "ideas-details-status-comment user-role u-bdcolor-2 dev"]//p[#class '
'= "username user-role-username"]/text()')
temp_list.append(dev_name.extract_first())
dev_comment = response.xpath('//div[#class = "message post-content ng-star-inserted"]/p/text()')
temp_list.append(dev_comment.extract_first())
c_author_index = 0
c_body_index = 0
c_author_path = response.xpath('//article[#class = "post-list-item clearfix two-columns '
'ng-star-inserted"]//span[#class = "username-content"]/text()')
while c_author_index < len(c_author_path):
comment_author = c_author_path[c_author_index]
temp_list.append(comment_author.extract())
c_author_index += 1
c_body_combined = ""
c_body_path = '//div[#class = "post-list-comments"]/g2g-comments-item[1]/article[#class = ' \
'"post-list-item clearfix two-columns ng-star-inserted"]/div/div//div[#class ' \
'="post-list-item-message-content post-content ng-star-inserted"]//text() '
c_body = response.xpath(c_body_path.replace("1", str(c_body_index + 1)))
c_body_list = c_body.extract()
if len(c_body_list) > 1:
for word in c_body_list:
c_body_combined = c_body_combined + " " + word
temp_list.append(c_body_combined)
c_body_index += 1
elif len(c_body_list) != 0:
temp_list.append(c_body_list[0])
c_body_index += 1
elif len(c_body_list) == 0:
c_body_index += 1
c_body = response.xpath(c_body_path.replace("1", str(c_body_index + 1)))
c_body_list = c_body.extract()
if len(c_body_list) > 1:
for word in c_body_list:
c_body_combined = c_body_combined + " " + word
temp_list.append(c_body_combined)
c_body_index += 1
temp_list = list()
all_post_data = list()
process = CrawlerProcess()
process.crawl(IdeaSpider)
process.start()
print(temp_list)
This is because the comment pages are loaded using JavaScript and Scrapy is not rendering JavaScript. You could use Splash.
So i am very new to python programing. Just trying to figure out a good project to get me started. Wanted to attempt searching craigslist in multiple cities. I found a dated example online and used it as a starting point. The below script currently only has cities in ohio but i plan on adding all us cities. The "homecity" is currently set to Dayton. It asks for a search radius, search term, min price, and max price. Based on lat lon of cities it only searches cities in the radius. I also have it searching all pages if there is more than 1 page of results. At the end it creates an html file of the results and opens it in a browser. It seems to be working fine, but was hoping to get feedback on if i am doing everything efficiently. I would also like to add in a GUI to capture user inputs but not even sure where to start. Any advice there? Thanks!
#Craigslist Search
"""
Created on Thu Mar 27 11:56:54 2014
used http://cal.freeshell.org/2010/05/python-craigslist-search-script-version-2/ as
starting point.
"""
import re
import os
import os.path
import time
import urllib2
import webbrowser
from math import *
results = re.compile('<p.+</p>', re.DOTALL) #Find pattern for search results.
prices = re.compile('<span class="price".*?</span>', re.DOTALL) #Find pattern for
pages = re.compile('button pagenum">.*?</span>')
delay = 10
def search_all():
for city in list(set(searchcities)):#add another for loop for all pages
#Setup headers to spoof Mozilla
dat = None
ua = "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.1.4) Gecko/20091007 Firefox/3.5.4"
head = {'User-agent': ua}
errorcount=0
#Do a quick search to see how many pages of results
url = "http://" + city + ".craigslist.org/search/" + "sss?s=" + "0" + "&catAbb=sss&query=" + query.replace(' ', '+') + "&minAsk=" + pricemin + "&maxAsk=" + pricemax
req = urllib2.Request(url, dat, head)
try:
response = urllib2.urlopen(req)
except urllib2.HTTPError:
if errorcount < 1:
errorcount = 1
print "Request failed, retrying in " + str(delay) + " seconds"
time.sleep(int(delay))
response = urllib2.urlopen(req)
msg = response.read()
errorcount = 0
pglist = pages.findall(msg)
pg = pglist.pop(0)
if pg.find('of') == -1:
pg=100
else:
pg =pg[int((pg.find('of'))+3) : int((pg.find('</span>'))) ]
if int(pg)/100 == 0:
pg = 100
numpages = range(int(pg)/100)
for page in numpages:
print "searching...."
page = page*100
url = "http://" + city + ".craigslist.org/search/" + "sss?s=" + str(page) + "&catAbb=sss&query=" + query.replace(' ', '+') + "&minAsk=" + pricemin + "&maxAsk=" + pricemax
cityurl = "http://" + city + ".craigslist.org"
errorcount = 0
#Get page
req = urllib2.Request(url, dat, head)
try:
response = urllib2.urlopen(req)
except urllib2.HTTPError:
if errorcount < 1:
errorcount = 1
print "Request failed, retrying in " + str(delay) + " seconds"
time.sleep(int(delay))
response = urllib2.urlopen(req)
msg = response.read()
errorcount = 0
res = results.findall(msg)
res = str(res)
res = res.replace('[', '')
res = res.replace(']', '')
res = res.replace('<a href="' , '<a href="' + cityurl )
#res = re.sub(prices,'',res)
res = "<BLOCKQUOTE>"*6 + res + "</BLOCKQUOTE>"*6
outp = open("craigresults.html", "a")
outp.write(city)
outp.write(str(res))
outp.close()
def calcDist(lat_A, long_A, lat_B, long_B):#This was found at zip code database project
distance = (sin(radians(lat_A)) *
sin(radians(lat_B)) +
cos(radians(lat_A)) *
cos(radians(lat_B)) *
cos(radians(long_A - long_B)))
distance = (degrees(acos(distance))) * 69.09
return distance
cities = """akroncanton:41.043955,-81.51919
ashtabula:41.871212,-80.79178
athensohio:39.322847,-82.09728
cincinnati:39.104410,-84.50774
cleveland:41.473451,-81.73580
columbus:39.990764,-83.00117
dayton:39.757758,-84.18848
limaohio:40.759451,-84.08458
mansfield:40.759156,-82.51118
sandusky:41.426460,-82.71083
toledo:41.646649,-83.54935
tuscarawas:40.397916,-81.40527
youngstown:41.086279,-80.64563
zanesville:39.9461,-82.0122
"""
if os.path.exists("craigresults.html")==True:
os.remove("craigresults.html")
homecity = "dayton"
radius = raw_input("Search Distance from Home in Miles: ")
query = raw_input("Search Term: ")
pricemin = raw_input("Min Price: ")
pricemax = raw_input("Max Price: ")
citylist = cities.split()
#create dictionary
citdict = {}
for city in citylist:
items=city.split(":")
citdict[items[0]] = items[1]
homecord = str(citdict.get(homecity)).split(",")
homelat = float(homecord[0])
homelong = float(homecord[1])
searchcities = []
for key,value in citdict.items():
distcity=key
distcord=str(value).split(",")
distlat = float(distcord[0])
distlong = float(distcord[1])
dist = calcDist(homelat,homelong,distlat,distlong)
if dist < int(radius):
searchcities.append(key)
print searchcities
search_all()
webbrowser.open_new('craigresults.html')
I'm trying to Parse the following HTML pages using BeautifulSoup (I'm going to parse a bulk of pages).
I need to save all of the fields in every page, but they can change dynamically (on different pages).
here is an example of a page - Page 1
and a page with different fields order - Page 2
I've written the following code to parse the page.
import requests
from bs4 import BeautifulSoup
PTiD = 7680560
url = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=" + str(PTiD) + ".PN.&OS=PN/" + str(PTiD) + "&RS=PN/" + str(PTiD)
res = requests.get(url, prefetch = True)
raw_html = res.content
print "Parser Started.. "
bs_html = BeautifulSoup(raw_html, "lxml")
#Initialize all the Search Lists
fonts = bs_html.find_all('font')
para = bs_html.find_all('p')
bs_text = bs_html.find_all(text=True)
onlytext = [x for x in bs_text if x != '\n' and x != ' ']
#Initialize the Indexes
AppNumIndex = onlytext.index('Appl. No.:\n')
FiledIndex = onlytext.index('Filed:\n ')
InventorsIndex = onlytext.index('Inventors: ')
AssigneeIndex = onlytext.index('Assignee:')
ClaimsIndex = onlytext.index('Claims')
DescriptionIndex = onlytext.index(' Description')
CurrentUSClassIndex = onlytext.index('Current U.S. Class:')
CurrentIntClassIndex = onlytext.index('Current International Class: ')
PrimaryExaminerIndex = onlytext.index('Primary Examiner:')
AttorneyOrAgentIndex = onlytext.index('Attorney, Agent or Firm:')
RefByIndex = onlytext.index('[Referenced By]')
#~~Title~~
for a in fonts:
if a.has_key('size') and a['size'] == '+1':
d_title = a.string
print "title: " + d_title
#~~Abstract~~~
d_abstract = para[0].string
print "abstract: " + d_abstract
#~~Assignee Name~~
d_assigneeName = onlytext[AssigneeIndex +1]
print "as name: " + d_assigneeName
#~~Application number~~
d_appNum = onlytext[AppNumIndex + 1]
print "ap num: " + d_appNum
#~~Application date~~
d_appDate = onlytext[FiledIndex + 1]
print "ap date: " + d_appDate
#~~ Patent Number~~
d_PatNum = onlytext[0].split(':')[1].strip()
print "patnum: " + d_PatNum
#~~Issue Date~~
d_IssueDate = onlytext[10].strip('\n')
print "issue date: " + d_IssueDate
#~~Inventors Name~~
d_InventorsName = ''
for x in range(InventorsIndex+1, AssigneeIndex, 2):
d_InventorsName += onlytext[x]
print "inv name: " + d_InventorsName
#~~Inventors City~~
d_InventorsCity = ''
for x in range(InventorsIndex+2, AssigneeIndex, 2):
d_InventorsCity += onlytext[x].split(',')[0].strip().strip('(')
d_InventorsCity = d_InventorsCity.strip(',').strip().strip(')')
print "inv city: " + d_InventorsCity
#~~Inventors State~~
d_InventorsState = ''
for x in range(InventorsIndex+2, AssigneeIndex, 2):
d_InventorsState += onlytext[x].split(',')[1].strip(')').strip() + ','
d_InventorsState = d_InventorsState.strip(',').strip()
print "inv state: " + d_InventorsState
#~~ Asignee City ~~
d_AssigneeCity = onlytext[AssigneeIndex + 2].split(',')[1].strip().strip('\n').strip(')')
print "asign city: " + d_AssigneeCity
#~~ Assignee State~~
d_AssigneeState = onlytext[AssigneeIndex + 2].split(',')[0].strip('\n').strip().strip('(')
print "asign state: " + d_AssigneeState
#~~Current US Class~~
d_CurrentUSClass = ''
for x in range (CuurentUSClassIndex + 1, CurrentIntClassIndex):
d_CurrentUSClass += onlytext[x]
print "cur us class: " + d_CurrentUSClass
#~~ Current Int Class~~
d_CurrentIntlClass = onlytext[CurrentIntClassIndex +1]
print "cur intl class: " + d_CurrentIntlClass
#~~~Primary Examiner~~~
d_PrimaryExaminer = onlytext[PrimaryExaminerIndex +1]
print "prim ex: " + d_PrimaryExaminer
#~~d_AttorneyOrAgent~~
d_AttorneyOrAgent = onlytext[AttorneyOrAgentIndex +1]
print "agent: " + d_AttorneyOrAgent
#~~ Referenced by ~~
for x in range(RefByIndex + 2, RefByIndex + 400):
if (('Foreign' in onlytext[x]) or ('Primary' in onlytext[x])):
break
else:
d_ReferencedBy += onlytext[x]
print "ref by: " + d_ReferencedBy
#~~Claims~~
d_Claims = ''
for x in range(ClaimsIndex , DescriptionIndex):
d_Claims += onlytext[x]
print "claims: " + d_Claims
I insert all the text from the page to a list (using BeautifulSoup's find_all(text=True)). then I try to Find The indexes of the fields Names, and go over the list from that location and save the members to a string until I reach the next field index.
When I tried the code on several different pages I've noticed that the structure of the members is changing, and I can't find their indexes in the list.
for example, I search for the index of '123' and on some pages it shows in the list as '12','3'.
Can You think of any other way to parse the page that would be generic?
thanks.
I think the easiest solution is to use pyquery library
http://packages.python.org/pyquery/api.html
you can select the elements of the page using jquery selectors.
if you using beautifulsoup, and have dom <p>123</p> and find_all(text=True) you will have ['123']
however, if you have dom <p>12<b>3</b></p>, which have the same semantics as previous, but beautifulsoup will give you ['12','3']
maybe you could just find exactly which tag stucks you from getting complete ['123'] , and ignore / eliminate that tag first.
some fake code on how to eliminate <b> tag
import re
html='<p>12<b>3</b></p>'
reExp='<[\/\!]?b[^<>]*?>'
print re.sub(reExp,'',html)
for patterns, you could use this:
import re
patterns = '<TD align=center>(?P<VALUES_TO_FIND>.*?)<\/TD>'
print re.findall(patterns, your_html)