EDIT:(SOLVED) When I am reading the values in from my file a newline char is getting added onto the end.(\n) this is splitting my request string at that point.
I think it's to do with how I saved the values to the file in the first place. Many thanks.
I have I have the following code:
results = 'http://www.myurl.com/'+str(mystring)
print str(results)
request = urllib2.Request(results)
request.add_header('User-Agent','Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)')
opener = urllib2.build_opener()
text = opener.open(request).read()
Which is in a loop.
after the loop has run a few times str(mystring) changes to give a different set of results.
I can loop the script as many times as I like keeping the value of str(mystring) constant but every time I change the value of str(mystring) I get an error saying no host given when the code tries to build the opener.
opener = urllib2.build_opener()
Can anyone help please?
TIA,
Paul.
EDIT:
More code here.....
import sys
import string
import httplib
import urllib2
import re
import random
import time
def StripTags(text):
finished = 0
while not finished:
finished = 1
start = text.find("<")
if start >= 0:
stop = text[start:].find(">")
if stop >= 0:
text = text[:start] + text[start+stop+1:]
finished = 0
return text
mystring="test"
d={}
with open("myfile","r") as f:
while True:
page_counter=0
print str(mystring)
try:
while page_counter <20:
results = 'http://www.myurl.com/'+str(mystring)
print str(results)
request = urllib2.Request(results)
request.add_header('User-Agent','Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)')
opener = urllib2.build_opener()
text = opener.open(request).read()
finds = (re.findall('([\w\.\-]+'+mystring+')',StripTags(text)))
for find in finds:
d[find]=1
uniq_emails=d.keys()
page_counter = page_counter +1
print "found this " +str(finds)"
random.seed()
n = random.random()
i = n * 5
print "Pausing script for " + str(i) + " Seconds" + ""
time.sleep(i)
mystring=next(f)
except IOError:
print "No result found!"+""
I found the answer. It's as follows....
The values for mystring were read in from a file.
In the script I wrote to write the file I opens it with "w" instead of "wb".
Each line in the file ended with a newline character "/n".
When mystring was added to the string request the new line was being created in the middle of the request string.[1]
This would never have been apparent from my code because I changed it to post here in an effort to hide the real url I am using to get my results.[2]
My actual url looks more like this.....
Myurl.com/mystring/otherstuff/page_counter/morestuff.htm
The /n being read from the file spliced my url and gave urllib problems......
[1] I use windows. It adds lots of unseen things to text files. If I'd opened the file to write to with "wb" instead of "w" the contents would have been written without the unseen /n
[2] always post your full code kids. The good people of stackoverflow can't help you unless they can see what you are doing.....
Many thanks all, I hope this helps someone out at some point.
Paul.
In the while loop, you're setting results to something which is not a url:
results = 'myurl+str(mystring)'
It should probably be results = myurl+str(mystring)
By the way, it appears there's no need for all the casting to string (str()) you do:
(expanded on request)
print str(foo): in such a case, str() is never necessary. Python will always print foo's string representation
results = 'http://www.myurl.com/'+str(mystring). This is also unnecessary; mystring is already a string, so 'http://www.myurl.com/' + mystring would suffice.
print "Pausing script for " + str(i) + " Seconds". Here you would get an error without str() since you can't do string + int. However, print "foo", 1, "bar" does work. As do print "foo %i bar" % 1 and print "foo {0} bar".format(1) (see here)
Related
I need to split lines in to variables.
Here is an example of 2 lines:
port11.annex1.naples.net [30:00:00:03] "GET /logos/small_gopher.gif HTTP/1.0" 200 935
port11.annex1.naples.net [30:00:00:03] "GET /icons/book.gif" 200 935
However, as you can see sometimes a line is missing one piece.
How can I split this without errors?
Currently I am using:
for x in log.readlines():
data = x.split(" ")
hostname = data[0]
time = data[1]
command = data[2]
resource = data[3]
version = data[4]
status = data[5]
size = data[6]
This gives errors, because not every line has 7 "items"
Maybe I should use multiple delimiters to split, however I can't find a good way that works...
Why are you not doing it like this? Suppose your log string is this one:
log = r'port11.annex1.naples.net [30:00:00:03] "GET /icons/book.gif" 200 935'
data = log.split(" ")
for i in data:
print i
This way you won't have to give the index and will be able to remove hard-coding.
You could use a regex to match the different components of the log. Then you'll ned to check whether the request part consists of command, resource and version or only command and resource. Something like this could work:
import re
# open your log file here...
logmatcher = re.compile("([^ ]+) (\[[:0-9]+\]) (\"[^\"]+\") ([0-9]+) ([0-9]+)")
for x in log.readlines():
res = logmatcher.findall(x)
if len(res) > 0:
hostname = res[0][0]
time = res[0][1]
req = res[0][2][1:-1].split(" ") #[1:-1] to get rid of the ""
if len(req) > 2: # check if request contains the http version
command = req[0]
resource = req[1]
version = req[2]
else:
command = req[0]
resource = req[1]
version = "" # there's no version in the request. just use ""
status = res[0][3]
size = res[0][4]
there is about 70% chance shows error:
res=pool.map(feng,urls)
File "c:\Python27\lib\multiprocessing\pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "c:\Python27\lib\multiprocessing\pool.py", line 567, in get
raise self._value
IndexError: list index out of range
don't know why,if data less then 100,only 5%chance show that message.any one have idea how to improve?
#coding:utf-8
import multiprocessing
import requests
import bs4
import re
import string
root_url = 'http://www.haoshiwen.org'
#index_url = root_url+'/type.php?c=1'
def xianqin_url():
f = 0
h = 0
x = 0
y = 0
b = []
l=[]
for i in range(1,64):#页数
index_url=root_url+'/type.php?c=1'+'&page='+"%s" % i
response = requests.get(index_url)
soup = bs4.BeautifulSoup(response.text,"html.parser")
x = [a.attrs.get('href') for a in soup.select('div.sons a[href^=/]')]#取出每一页的div是sons的链接
c=len(x)#一共c个链接
j=0
for j in range(c):
url = root_url+x[j]
us = str(url)
print "收集到%s" % us
l.append(url) #pool = multiprocessing.Pool(8)
return l
def feng (url) :
response = requests.get(url)
response.encoding='utf-8'
#print response.text
soup = bs4.BeautifulSoup(response.text, "html.parser")
#content = soup.select('div.shileft')
qq=str(soup)
soupout = re.findall(r"原文(.+?)</div>",qq,re.S)#以“原文”开头<div>结尾的字段
#print soupout[1]
content=str(soupout[1])
b="风"
cc=content.count(b,0,len(content))
return cc
def start_process():
print 'Starting',multiprocessing.current_process().name
def feng (url) :
response = requests.get(url)
response.encoding='utf-8'
#print response.text
soup = bs4.BeautifulSoup(response.text, "html.parser")
#content = soup.select('div.shileft')
qq=str(soup)
soupout = re.findall(r"原文(.+?)</div>",qq,re.S)#以“原文”开头<div>结尾的字段
#print soupout[1]
content=str(soupout[1])
b="风"
c="花"
d="雪"
e="月"
f=content.count(b,0,len(content))
h=content.count(c,0,len(content))
x=content.count(d,0,len(content))
y=content.count(e,0,len(content))
return f,h,x,y
def find(urls):
r= [0,0,0,0]
pool=multiprocessing.Pool()
res=pool.map4(feng,urls)
for i in range(len(res)):
r=map(lambda (a,b):a+b, zip(r,res[i]))
return r
if __name__=="__main__":
print "开始收集网址"
qurls=xianqin_url()
print "收集到%s个链接" % len(qurls)
print "开始匹配先秦诗文"
find(qurls)
print '''
%s篇先秦文章中:
---------------------------
风有:%s
花有:%s
雪有:%s
月有:%s
数据来源:%s
''' % (len(qurls),find(qurls)[0],find(qurls)[1],find(qurls)[2],find(qurls)[3],root_url)
stackoverflow :Body cannot contain "`pool ma p".
changed it as res=pool.map4(feng,urls)
i'm trying to get some sub string from this website,with multiprocessing.
Indeed, multiprocessing makes it a bit hard to debug as you don't see where the index out of bound error occurred (the error message makes it appear as if it happened internally in the multiprocessing module).
In some cases this line:
content=str(soupout[1])
raises an index out of bound, because soupout is an empty list. If you change it to
if len(soupout) == 0:
return None
and then remove the None that were returned by changing
res=pool.map(feng,urls)
into
res = pool.map(feng,urls)
res = [r for r in res if r is not None]
then you can avoid the error. That said. You probably want to find out the root cause why re.findall returned an empty list. It is certainly a better idea to select the node with beatifulsoup than with regex, as generally matching with bs4 is more stable, especially if the website slightly changes their markup (e.g. whitespaces, etc.)
Update:
why is soupout is an empty list? When I didn't use pool.map never I have this error message shown
This is probably because you hammer the web server too fast. In a comment you mention that you sometimes get 504 in response.status_code. 504 means Gateway Time-out: The server was acting as a gateway or proxy and did not receive a timely response from the upstream server
This is because haoshiwen.org seems to be powered by kangle which is a reverse proxy. Now the reverse proxy handles back all the requests you send him to the web server behind, and if you now start too many processes at once the poor web server cannot handle the flood. Kangle has a default timeout of 60s so as soon as he doesn't get an answer back from the web server within 60s he shows the error you posted.
How do you fix that?
you could limit the number of processes: pool=multiprocessing.Pool(2), you'd need to play around with a good number of processes
at the top of feng(url) you could add a time.sleep(5) so each process waits 5 seconds between each request. Also here you'd need to play around with the sleep time.
I've recently been trying to scrape a site that contains chemistry exam tests in pdf using Python. I used requests for python and everything was going well, until some of the downloads were cut short at a very small size i.e. 2KB. What's curious though - it happens completely at random with every run of the script the files cut are different. I've been scratching my head for a while now and decided to ask here. Downloading them manually probably would have proved faster by now, but I want to know why the script isn't working, for future reference.
I've written the script to be asynchronous, thus it occurred to me that I could have been DoSing the server. However, I've replaced every Pool with a synchronous for loop, even adding time.sleep() here and there - it didn't help. Using this approach none of the files were fully downloaded - practically every single one stopping at 2KB.
Please forgive me if the question is naive or my mistake is foolish as I am only a hobby programmer. I'll be grateful for any help.
P.S. I've intercepted the headers using Postman from Chrome, without them the response was 500, however I won't include them as they contain session ids that would enable you to login into my account.
The script is as follows:
from shutil import copyfileobj
from multiprocessing.dummy import Pool as ThreadPool
from requests import get
from time import sleep
titles = {
"95": "Budowa atomu. Układ okresowy pierwiastków chemicznych",
"96": "Wiązania chemiczne",
"97": "Systematyka związków nieorganicznych",
"98": "Stechiometria",
"99": "Reakcje utleniania-redukcji. Elektrochemia",
"100": "Roztwory",
"101": "Kinetyka chemiczna",
"102": "Reakcje w wodnych roztworach elektrolitów",
"103": "Charakterystyka pierwiastków i związków chemicznych",
"104": "Chemia organiczna jako chemia związków węgla",
"105": "Węglowodory",
"106": "Jednofunkcyjne pochodne węglowodorów",
"107": "Wielofunkcyjne pochodne węglowodorów",
"108": "Arkusz maturalny"
}
#collection = {"120235": "Chemia nieorganiczna", "120586": "Chemia organiczna"}
url = "https://e-testy.terazmatura.pl/print/%s/quiz_%s/%s"
def downloadTest(id):
with ThreadPool(2) as tp:
tp.starmap(downloadActualTest, [(id, "blank"), (id, "key")])
def downloadActualTest(id, dataType):
name = titles[str(id)]
if id in range(95, 104):
collectionId = 120235
else:
collectionId = 120586
if dataType == "blank":
with open("Pulled Data/%s - pusty.pdf" % name, "wb") as test:
print("Downloading: " + url % (collectionId, id, "blank") + '\n')
r = get(url % (collectionId, id, "blank"),
stream=True,
headers=headers)
r.raw.decode_content = True
copyfileobj(r.raw, test)
elif dataType == "key":
with open("Pulled Data/%s - klucz.pdf" % name, "wb") as test:
print("Downloading: " + url % (collectionId, id, "key") + '\n')
r = get(url % (collectionId, id, "key"),
stream=True,
headers=headers)
r.raw.decode_content = True
copyfileobj(r.raw, test)
with ThreadPool(3) as p:
p.map(downloadTest, range(95, 109))
I have a question with using python and beautifulsoup.
My end result program basically fills out a form on a website and brings me back the results which I will eventually output to an lxml file. I'll be taking the results from https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS and I want to get a list for every city all into some excel documents.
Here is my code, I put it on pastebin:
http://pastebin.com/bZJfMp2N
MY RESULTS ARE ALMOST GOOD :D except now I'm getting 355 for my "correct value" instead of 355, for example. I want to parse that and only show the number, you will see when you put this into python.
However, anything I have tried does NOT work, there is no way I can parse that values_2 variable because the results are in bs4.element.resultset when I think i need to parse a string. Sorry if I am a noob, I am still learning and have worked very long on this program.
Would anyone have any input? Anything would be appreciated! I've read up that my results are in a list or something and i can't parse lists? How would I go about doing this?
Here is the code:
__author__ = 'kennytruong'
#THE PROBLEM HERE IS TO PARSE THE RESULTS PROPERLY!!
import urllib.parse, urllib.request
import re
from bs4 import BeautifulSoup
URL = "https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS"
#Goes through these locations, strips the whitespace in the string and creates a list that starts at every new line
LOCATIONS = '''
ALAMEDA ALAMEDA
'''.strip().split('\n') #strip() basically removes whitespaces
print('Available locations to choose from:', LOCATIONS)
INSURANCE_TYPES = '''
HOMEOWNERS,CONDOMINIUM,MOBILEHOME,RENTERS,EARTHQUAKE - Single Family,EARTHQUAKE - Condominium,EARTHQUAKE - Mobilehome,EARTHQUAKE - Renters
'''.strip().split(',') #strips the whitespaces and starts a newline of the list every comma
print('Available insurance types to choose from:', INSURANCE_TYPES)
COVERAGE_AMOUNTS = '''
15000,25000,35000,50000,75000,100000,150000,200000,250000,300000,400000,500000,750000
'''.strip().split(',')
print('All options for coverage amounts:', COVERAGE_AMOUNTS)
HOME_AGE = '''
New,1-3 Years,4-6 Years,7-15 Years,16-25 Years,26-40 Years,41-70 Years
'''.strip().split(',')
print('All Home Age Options:', HOME_AGE)
def get_premiums(location, coverage_type, coverage_amt, home_age):
formEntries = {'location':location,
'coverageType':coverage_type,
'coverageAmount':coverage_amt,
'homeAge':home_age}
inputData = urllib.parse.urlencode(formEntries)
inputData = inputData.encode('utf-8')
request = urllib.request.Request(URL, inputData)
response = urllib.request.urlopen(request)
responseData = response.read()
soup = BeautifulSoup(responseData, "html.parser")
parseResults = soup.find_all('tr', {'valign':'top'})
for eachthing in parseResults:
parse_me = eachthing.text
name = re.findall(r'[A-z].+', parse_me) #find me all the words that start with a cap, as many and it doesn't matter what kind.
# the . for any character and + to signify 1 or more of it.
values = re.findall(r'\d{1,10}', parse_me) #find me any digits, however many #'s long as long as btwn 1 and 10
values_2 = eachthing.find_all('div', {'align':'right'})
print('raw code for this part:\n' ,eachthing, '\n')
print('here is the name: ', name[0], values)
print('stuff on sheet 1- company name:', name[0], '- Premium Price:', values[0], '- Deductible', values[1])
print('but here is the correct values - ', values_2) #NEEDA STRIP THESE VALUES
# print(type(values_2)) DOING SO GIVES ME <class 'bs4.element.ResultSet'>, NEEDA PARSE bs4.element type
# values_3 = re.split(r'\d', values_2)
# print(values_3) ANYTHING LIKE THIS WILL NOT WORK BECAUSE I BELIEVE RESULTS ARENT STRING
print('\n\n')
def main():
for location in LOCATIONS: #seems to be looping the variable location in LOCATIONS - each location is one area
print('Here are the options that you selected: ', location, "HOMEOWNERS", "150000", "New", '\n\n')
get_premiums(location, "HOMEOWNERS", "150000", "New") #calls function get_premiums and passes parameters
if __name__ == "__main__": #this basically prevents all the indent level 0 code from getting executed, because otherwise the indent level 0 code gets executed regardless upon opening
main()
This one has had me stumped for a couple of days now and I believe I've finally narrowed it down to this block of code. If anyone can tell me how to fix this, and why it is happening it would be awesome.
import urllib2
GetLink = 'http://somesite.com/search?q=datadata#page'
holder = range(1,3)
for LinkIncrement in holder:
h = GetLink + str(LinkIncrement)
ReadLink = urllib2.urlopen(h)
f = open('test.txt', 'w')
for line in ReadLink:
f.write(line)
f.close()
main() #calls function main that does stuff with the file
continue
The problem is it will only write the data from 'http://somesite.com/search?q=datadata#page' if I do the below the results print correctly.
for LinkIncrement in holder:
h = GetLink + str(LinkIncrement)
print h
The link I am copying does indeed increment in this manner and I am able to open the urls by copying and pasting. Additionally, I have tried this with a while loop, but always get the same results.
The below code opens 3 tabs with the incremented urls /search?q=datadata#page1, /search?q=datadata#page2, and /search?q=datadata#page3. Just can't make it work in my code.
import webbrowser
import urllib2
h = ''
def tab(passed):
url = passed
webbrowser.open_new_tab(url + '/')
def test():
g = 'http://somesite.com/search?q=datadata#page'
f = urllib2.urlopen(g)
NewVar = 1
PageCount = 1
while PageCount < 4:
h = g + str(NewVar)
PageCount += 1
NewVar += 1
tab(h)
test()
Thanks to Falsetru for helping me figure this out. The website was using json for any pages after the first page.
In the url, the part after # (fragment identifier) is not passed to web server; Server respond with same content because parts before framents identifier are same.
#something is handled by browser (javascript). You need to see what happens in javascript.