Python Link to File Iterator not Iterating - python

This one has had me stumped for a couple of days now and I believe I've finally narrowed it down to this block of code. If anyone can tell me how to fix this, and why it is happening it would be awesome.
import urllib2
GetLink = 'http://somesite.com/search?q=datadata#page'
holder = range(1,3)
for LinkIncrement in holder:
h = GetLink + str(LinkIncrement)
ReadLink = urllib2.urlopen(h)
f = open('test.txt', 'w')
for line in ReadLink:
f.write(line)
f.close()
main() #calls function main that does stuff with the file
continue
The problem is it will only write the data from 'http://somesite.com/search?q=datadata#page' if I do the below the results print correctly.
for LinkIncrement in holder:
h = GetLink + str(LinkIncrement)
print h
The link I am copying does indeed increment in this manner and I am able to open the urls by copying and pasting. Additionally, I have tried this with a while loop, but always get the same results.
The below code opens 3 tabs with the incremented urls /search?q=datadata#page1, /search?q=datadata#page2, and /search?q=datadata#page3. Just can't make it work in my code.
import webbrowser
import urllib2
h = ''
def tab(passed):
url = passed
webbrowser.open_new_tab(url + '/')
def test():
g = 'http://somesite.com/search?q=datadata#page'
f = urllib2.urlopen(g)
NewVar = 1
PageCount = 1
while PageCount < 4:
h = g + str(NewVar)
PageCount += 1
NewVar += 1
tab(h)
test()
Thanks to Falsetru for helping me figure this out. The website was using json for any pages after the first page.

In the url, the part after # (fragment identifier) is not passed to web server; Server respond with same content because parts before framents identifier are same.
#something is handled by browser (javascript). You need to see what happens in javascript.

Related

pool.map list index out of range python

there is about 70% chance shows error:
res=pool.map(feng,urls)
File "c:\Python27\lib\multiprocessing\pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "c:\Python27\lib\multiprocessing\pool.py", line 567, in get
raise self._value
IndexError: list index out of range
don't know why,if data less then 100,only 5%chance show that message.any one have idea how to improve?
#coding:utf-8
import multiprocessing
import requests
import bs4
import re
import string
root_url = 'http://www.haoshiwen.org'
#index_url = root_url+'/type.php?c=1'
def xianqin_url():
f = 0
h = 0
x = 0
y = 0
b = []
l=[]
for i in range(1,64):#页数
index_url=root_url+'/type.php?c=1'+'&page='+"%s" % i
response = requests.get(index_url)
soup = bs4.BeautifulSoup(response.text,"html.parser")
x = [a.attrs.get('href') for a in soup.select('div.sons a[href^=/]')]#取出每一页的div是sons的链接
c=len(x)#一共c个链接
j=0
for j in range(c):
url = root_url+x[j]
us = str(url)
print "收集到%s" % us
l.append(url) #pool = multiprocessing.Pool(8)
return l
def feng (url) :
response = requests.get(url)
response.encoding='utf-8'
#print response.text
soup = bs4.BeautifulSoup(response.text, "html.parser")
#content = soup.select('div.shileft')
qq=str(soup)
soupout = re.findall(r"原文(.+?)</div>",qq,re.S)#以“原文”开头<div>结尾的字段
#print soupout[1]
content=str(soupout[1])
b="风"
cc=content.count(b,0,len(content))
return cc
def start_process():
print 'Starting',multiprocessing.current_process().name
def feng (url) :
response = requests.get(url)
response.encoding='utf-8'
#print response.text
soup = bs4.BeautifulSoup(response.text, "html.parser")
#content = soup.select('div.shileft')
qq=str(soup)
soupout = re.findall(r"原文(.+?)</div>",qq,re.S)#以“原文”开头<div>结尾的字段
#print soupout[1]
content=str(soupout[1])
b="风"
c="花"
d="雪"
e="月"
f=content.count(b,0,len(content))
h=content.count(c,0,len(content))
x=content.count(d,0,len(content))
y=content.count(e,0,len(content))
return f,h,x,y
def find(urls):
r= [0,0,0,0]
pool=multiprocessing.Pool()
res=pool.map4(feng,urls)
for i in range(len(res)):
r=map(lambda (a,b):a+b, zip(r,res[i]))
return r
if __name__=="__main__":
print "开始收集网址"
qurls=xianqin_url()
print "收集到%s个链接" % len(qurls)
print "开始匹配先秦诗文"
find(qurls)
print '''
%s篇先秦文章中:
---------------------------
风有:%s
花有:%s
雪有:%s
月有:%s
数据来源:%s
''' % (len(qurls),find(qurls)[0],find(qurls)[1],find(qurls)[2],find(qurls)[3],root_url)
stackoverflow :Body cannot contain "`pool ma p".
changed it as res=pool.map4(feng,urls)
i'm trying to get some sub string from this website,with multiprocessing.
Indeed, multiprocessing makes it a bit hard to debug as you don't see where the index out of bound error occurred (the error message makes it appear as if it happened internally in the multiprocessing module).
In some cases this line:
content=str(soupout[1])
raises an index out of bound, because soupout is an empty list. If you change it to
if len(soupout) == 0:
return None
and then remove the None that were returned by changing
res=pool.map(feng,urls)
into
res = pool.map(feng,urls)
res = [r for r in res if r is not None]
then you can avoid the error. That said. You probably want to find out the root cause why re.findall returned an empty list. It is certainly a better idea to select the node with beatifulsoup than with regex, as generally matching with bs4 is more stable, especially if the website slightly changes their markup (e.g. whitespaces, etc.)
Update:
why is soupout is an empty list? When I didn't use pool.map never I have this error message shown
This is probably because you hammer the web server too fast. In a comment you mention that you sometimes get 504 in response.status_code. 504 means Gateway Time-out: The server was acting as a gateway or proxy and did not receive a timely response from the upstream server
This is because haoshiwen.org seems to be powered by kangle which is a reverse proxy. Now the reverse proxy handles back all the requests you send him to the web server behind, and if you now start too many processes at once the poor web server cannot handle the flood. Kangle has a default timeout of 60s so as soon as he doesn't get an answer back from the web server within 60s he shows the error you posted.
How do you fix that?
you could limit the number of processes: pool=multiprocessing.Pool(2), you'd need to play around with a good number of processes
at the top of feng(url) you could add a time.sleep(5) so each process waits 5 seconds between each request. Also here you'd need to play around with the sleep time.

urllib.urlopen() on variables that are assigned a string values wont work?

Im writing some code to parse an XML file. Im just wondering if someone could explain why this isn't working. If I put link itself into urllib.urlopen(), it does not seem to make it to that url. However, when I put "http://gdata.youtube.com/feeds/api/standardfeeds/top_rated?max- results=50&time=today" inside urllib.urlopen(), it works. Does it need to be a string and not a variable or is there a way around it?
import urllib
from bs4 import BeautifulSoup
class Uel(object):
def __init__(self, link):
self.content_data = []
self.num_likes = []
self.num_dislikes = []
self.favoritecount = []
self.view_count = []
self.link = link
self.web_obj = urllib.urlopen(link)
self.file = open('youtubequery.txt', 'w+')
self.file.write(str(self.web_obj))
for i in self.web_obj:
self.file.write(i)
with open("youtubequery.txt", "r") as myfile:
self.file_2=myfile.read()
self.soup = BeautifulSoup(self.file_2)
for link in self.soup.find_all("content"):
self.content_data.append(str(link.get("src")))
for stat in self.soup.find_all("yt:statistics"):
self.favoritecount.append(str(stat.get("favoritecount")))
for views in self.soup.find_all("yt:statistics"):
self.view_count.append(str(views.get("viewcount")))
for numlikes in self.soup.find_all("yt:rating"):
self.num_likes.append(str(numlikes.get("numlikes")))
for numdislikes in self.soup.find_all("yt:rating"):
self.num_dislikes.append(str(numdislikes.get("numdislikes")))
def __str__(self):
return str(self.content_data),str(self.num_likes), str(self.num_dislikes)
link = "http://gdata.youtube.com/feeds/api/standardfeeds/top_rated?max- results=50&time=5"
data = Uel(link)
print data.__str__()
In the code you've presented, you are using this url:
http://gdata.youtube.com/feeds/api/standardfeeds/top_rated?max- results=50&time=5
a request to which produces:
Invalid value for time parameter: 5
But, in the question itself, you've mentioned the following URL:
http://gdata.youtube.com/feeds/api/standardfeeds/top_rated?max- results=50&time=today
which has time=today. The code with this URL works for me.

Same code used in multiple functions but with minor differences - how to optimize?

This is the code of a Udacity course, and I changed it a little. Now, when it runs, it asks me for a movie name and the trailer would open in a pop up in a browser (that's another part, which is not shown).
As you can see, this program has a lot of repetitive code in it, the functions extract_name, movie_poster_url and movie_trailer_url have kind of the same code. Is there a way to get rid of the same code being repeated but have the same output? If so, will it run faster?
import fresh_tomatoes
import media
import urllib
import requests
from BeautifulSoup import BeautifulSoup
name = raw_input("Enter movie name:- ")
global movie_name
def extract_html(name):
url = "website name" + name + "continuation of website name" + name + "again continuation of web site name"
response = requests.get(url)
page = str(BeautifulSoup(response.content))
return page
def extract_name(page):
start_link = page.find(' - IMDb</a></h3><div class="s"><div class="kv"')
start_url = page.find('>',start_link-140)
start_url1 = page.find('>', start_link-140)
end_url = page.find(' - IMDb</a>', start_link-140)
name_of_movie = page[start_url1+1:end_url]
return extract_char(name_of_movie)
def extract_char(name_of_movie):
name_array = []
for words in name_of_movie:
word = words.strip('</b>,')
name_array.append(word)
return ''.join(name_array)
def movie_poster_url(name_of_movie):
movie_name, seperator, tail = name_of_movie.partition(' (')
#movie_name = name_of_movie.rstrip('()0123456789 ')
page = urllib.urlopen('another web site name' + movie_name + 'continuation of website name').read()
start_link = page.find('"Poster":')
start_url = page.find('"',start_link+9)
end_url = page.find('"',start_url+1)
poster_url = page[start_url+1:end_url]
return poster_url
def movie_trailer_url(name_of_movie):
movie_name, seperator, tail = name_of_movie.partition(' (')
#movie_name = name_of_movie.rstrip('()0123456789 ')
page = urllib.urlopen('another website name' + movie_name + " trailer").read()
start_link = page.find('<div class="yt-lockup-dismissable"><div class="yt-lockup-thumbnail contains-addto"><a aria-hidden="true" href=')
start_url = page.find('"',start_link+110)
end_url = page.find('" ',start_url+1)
trailer_url1 = page[start_url+1:end_url]
trailer_url = "www.youtube.com" + trailer_url1
return trailer_url
page = extract_html(name)
movie_name = extract_name(page)
new_movie = media.Movie(movie_name, "Storyline WOW", movie_poster_url(movie_name), movie_trailer_url(movie_name))
movies = [new_movie]
fresh_tomatoes.open_movies_page(movies)
You could move the shared parts into their own function:
def find_page(url, name, find, offset):
movie_name, seperator, tail = name_of_movie.partition(' (')
page = urllib.urlopen(url.format(name)).read()
start_link = page.find(find)
start_url = page.find('"',start_link+offset)
end_url = page.find('" ',start_url+1)
return page[start_url+1:end_url]
def movie_poster_url(name_of_movie):
return find_page("another website name{} continuation of website name", name_of_movie, '"Poster":', 9)
def movie_trailer_url(name_of_movie):
trailer_url = find_page("another website name{} trailer", name_of_movie, '<div class="yt-lockup-dismissable"><div class="yt-lockup-thumbnail contains-addto"><a aria-hidden="true" href=', 110)
return "www.youtube.com" + trailer_url
It definetely wont run faster (there is extra work to do to "switch" between the functions) but the performance difference is probably negligable.
For your second question: Profiling is not a technique or method, it's "finding out what's being bad" in your code:
Profiling is a form of
dynamic program analysis that measures, for example, the space
(memory) or time complexity of a program, the usage of particular
instructions, or the frequency and duration of function calls.
(wikipedia)
So it's not something that speeds up your program, it's a word for things you do to find out what you can do to speed up your program.
Going really quickly here because I am a super newb but I can see the repetition; what I would do is to figure out the (mostly) repeating blocks of code shared by all 3 functions and then figure out where they differ; write a new function that takes the differences as the arguments. so for instance:
def extract(tarString,delim,startDiff,endDiff):
start_link = page.find(tarString)
start_url = page.find(delim,start_link+startDiff)
end_url = page.find(delim,start_url+endDiff)
url_out = page[start_url+1:end_url]
Then, in your poster, trailer, etc functions, just call this extract function with the appropriate arguments for each case. ie poster would call
poster_url=extract(tarString='"Poster:"',delim='"',startDiff=9, endDiff=1)
I can see you've got another answer already and it's very likely it's written by someone who knows more than I do, but I hope you get something out of my "philosophy of modularizing" from a newbie perspective.

Python refresh file from disk

I have a python script that calls a system program and reads the output from a file out.txt, acts on that output, and loops. However, it doesn't work, and a close investigation showed that the python script just opens out.txt once and then keeps on reading from that old copy. How can I make the python script reread the file on each iteration? I saw a similar question here on SO but it was about a python script running alongside a program, not calling it, and the solution doesn't work. I tried closing the file before looping back but it didn't do anything.
EDIT:
I already tried closing and opening, it didn't work. Here's the code:
import subprocess, os, sys
filename = sys.argv[1]
file = open(filename,'r')
foo = open('foo','w')
foo.write(file.read().rstrip())
foo = open('foo','a')
crap = open(os.devnull,'wb')
numSolutions = 0
while True:
subprocess.call(["minisat", "foo", "out"], stdout=crap,stderr=crap)
out = open('out','r')
if out.readline().rstrip() == "SAT":
numSolutions += 1
clause = out.readline().rstrip()
clause = clause.split(" ")
print clause
clause = map(int,clause)
clause = map(lambda x: -x,clause)
output = ' '.join(map(lambda x: str(x),clause))
print output
foo.write('\n'+output)
out.close()
else:
break
print "There are ", numSolutions, " solutions."
You need to flush foo so that the external program can see its latest changes. When you write to a file, the data is buffered in the local process and sent to the system in larger blocks. This is done because updating the system file is relatively expensive. In your case, you need to force a flush of the data so that minisat can see it.
foo.write('\n'+output)
foo.flush()
I rewrote it to hopefully be a bit easier to understand:
import os
from shutil import copyfile
import subprocess
import sys
TEMP_CNF = "tmp.in"
TEMP_SOL = "tmp.out"
NULL = open(os.devnull, "wb")
def all_solutions(cnf_fname):
"""
Given a file containing a set of constraints,
generate all possible solutions.
"""
# make a copy of original input file
copyfile(cnf_fname, TEMP_CNF)
while True:
# run minisat to solve the constraint problem
subprocess.call(["minisat", TEMP_CNF, TEMP_SOL], stdout=NULL,stderr=NULL)
# look at the result
with open(TEMP_SOL) as result:
line = next(result)
if line.startswith("SAT"):
# Success - return solution
line = next(result)
solution = [int(i) for i in line.split()]
yield solution
else:
# Failure - no more solutions possible
break
# disqualify found solution
with open(TEMP_CNF, "a") as constraints:
new_constraint = " ".join(str(-i) for i in sol)
constraints.write("\n")
constraints.write(new_constraint)
def main(cnf_fname):
"""
Given a file containing a set of constraints,
count the possible solutions.
"""
count = sum(1 for i in all_solutions(cnf_fname))
print("There are {} solutions.".format(count))
if __name__=="__main__":
if len(sys.argv) == 2:
main(sys.argv[1])
else:
print("Usage: {} cnf.in".format(sys.argv[0]))
You take your file_var and end the loop with file_var.close().
for ... :
ga_file = open(out.txt, 'r')
... do stuff
ga_file.close()
Demo of an implementation below (as simple as possible, this is all of the Jython code needed)...
__author__ = ''
import time
var = 'false'
while var == 'false':
out = open('out.txt', 'r')
content = out.read()
time.sleep(3)
print content
out.close()
generates this output:
2015-01-09, 'stuff added'
2015-01-09, 'stuff added' # <-- this is when i just saved my update
2015-01-10, 'stuff added again :)' # <-- my new output from file reads
I strongly recommend reading the error messages. They hold quite a lot of information.
I think the full file name should be written for debug purposes.

How to insert a "missing" page as blank page in PDF with Python?

Say you have to join some pages that are number 2, 4 and 5… (the files are named test_002.pdf, test_004.pdf and test_005.pdf), then we could say there is a page 3 missing.
What I try to do is having a result from those commands :
pdfjam --nup 2 --papersize '{47cm,30cm}' --scale 1.0 test_002.pdf test_003.pdf --outfile joined_002-003.pdf
pdfjam --nup 2 --papersize '{47cm,30cm}' --scale 1.0 test_004.pdf test_005.pdf --outfile joined_004-005.pdf
that will join even and odd page in one unique page, with a blank page (3) in place of the missing page.
I guess it should:
check incoming files from the beginning to the end looking for what page is missing (in this case from 2 to 5 missing #3)
on-the-fly generate blank '23.5cm,30cm' pdf pages (using pyPdf maybe)
classify them 'even' and 'odd' as couples to be able to join every even with odd page (using pdfjam)…
Am I right?
Is that possible with some lines of Python?
Or is there a easier way?
Because here's what I started to do, making it work like an hotfolder, but I'm really completely lost in the even and odd management and missing "files/pages" :
#!/usr/bin/python
# -*- coding: UTF8 -*-
import os
import os.path
import re
import time
import datetime
CODEFILE = re.compile("^(TES|EXA).*\.pdf$")
WHERE = "/tmp/TEST/"
STORAGE = "/tmp/WORK/"
DBLSIZE = "{47cm,30cm}"
def time_stamp():
now = datetime.datetime.now()
return now.strftime("%Y-%m-%d %H:%M:%S")
print(time_stamp()+" : Starting.")
def files_list(path):
this_files = list()
root, dires, files = os.walk(path).next()
for f in files:
if CODEFILE.match(f):
this_files.append(os.path.join(root, f))
return this_files
def file_sizes(filename):
meta = os.lstat(filename)
return meta.st_size
def files_to_handle(path):
this_files = list()
ft1 = dict()
ft2 = dict()
for f in files_list(WHERE):
ft1[f] = file_sizes(f)
time.sleep(10)
for f in files_list(WHERE):
ft2[f] = file_sizes(f)
for f, t in ft2.items():
try:
if ft1[f] == t:
this_files.append(f)
except:
pass
return this_files
r = files_to_handle(WHERE)
print(time_stamp()+" : Files available :")
print(r)
for f in r:
rc = os.system("pdfjam --batch --nup 2 --papersize {1} --scale 1.0 --outfile . {2}".format(
DBLSIZE, f))
if rc != 0:
print(time_stamp()+" : an ERROR as occured with the file {0}.".format(f))
else:
print(time_stamp()+" : files {0} OK.".format(f))
os.system("mv {0} {1}".format(f, STORAGE))
print(time_stamp()+" : Stopping.")
Thanks in advance!

Categories

Resources