I have a log file which contains:
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/
http://www.downloadray.com/windows/Photos_and_Images/Graphic_Capture/
http://www.downloadray.com/windows/Photos_and_Images/Digital_Photo_Tools/
I have this code:
from bs4 import BeautifulSoup
import urllib
import urlparse
f = open("downloadray2.txt")
g = open("downloadray3.txt", "w")
for line in f.readlines():
i = 1
while 1:
url = line+"?page=%d" % i
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
has_more = 1
for a in soup.select("div.n_head2 a[href]"):
try:
print (a["href"])
g.write(a["href"]+"\n")
except:
print "no link"
if has_more:
i += 1
else:
break
This code do not give error but it do not working.
I tried modified it but can't solved it.
But when I try this code,it works well:
from bs4 import BeautifulSoup
import urllib
import urlparse
g = open("downloadray3.txt", "w")
url = "http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/"
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)
i = 1
while 1:
url1 = url+"?page=%d" % i
pageHtml = urllib.urlopen(url1)
soup = BeautifulSoup(pageHtml)
has_more = 2
for a in soup.select("div.n_head2 a[href]"):
try:
print (a["href"])
g.write(a["href"]+"\n")
except:
print "no link"
if has_more:
i += 1
else:
break
So how can I make it can read from the log text file. It is hard to take link one by one to be read.
Have you stripped the newline from the end of the line?
for line in f.readlines():
line = line.strip()
readlines() will produce a list of lines taken from the file including the newline \n character.
Proof Evidence by printing url variable (after the line url = line+"?page=%d" % i):
Your original code:
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/
?page=1
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/
?page=2
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/
?page=3
With my suggested fix:
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/?page=1
http://www.downloadray.com/TIFF-to-JPG_download/
http://www.downloadray.com/Moo0-Image-Thumbnailer_download/
http://www.downloadray.com/Moo0-Image-Sizer_download/
http://www.downloadray.com/Advanced-Image-Viewer-and-Converter_download/
http://www.downloadray.com/GandMIC_download/
http://www.downloadray.com/SendTo-Convert_download/
http://www.downloadray.com/PNG-To-JPG-Converter-Software_download/
http://www.downloadray.com/Graphics-Converter-Pro_download/
http://www.downloadray.com/PICtoC_download/
http://www.downloadray.com/Free-Images-Converter_download/
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/?page=2
http://www.downloadray.com/VarieDrop_download/
http://www.downloadray.com/Tinuous_download/
http://www.downloadray.com/Acme-CAD-Converter_download/
http://www.downloadray.com/AAOImageConverterandFTP_download/
http://www.downloadray.com/ImageCool-Converter_download/
http://www.downloadray.com/GeoJpeg_download/
http://www.downloadray.com/Android-Resizer-Tool_download/
http://www.downloadray.com/Scarab-Darkroom_download/
http://www.downloadray.com/Jpeg-Resizer_download/
http://www.downloadray.com/TIFF2PDF_download/
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/?page=3
http://www.downloadray.com/JGraphite_download/
http://www.downloadray.com/Easy-PNG-to-Icon-Converter_download/
http://www.downloadray.com/JBatch-It!_download/
http://www.downloadray.com/Batch-It!-Pro_download/
http://www.downloadray.com/Batch-It!-Ultra_download/
http://www.downloadray.com/Image-to-Ico-Converter_download/
http://www.downloadray.com/PSD-To-PNG-Converter-Software_download/
http://www.downloadray.com/VectorNow_download/
http://www.downloadray.com/KeitiklImages_download/
http://www.downloadray.com/STOIK-Smart-Resizer_download/
Update:
Then again, this code won't run as expected, because the while loop will never continue since the has_more variable is never changed.
You know that you don't have more links when the list returned by `soup.select(...)` is empty. You can check for emptiness using `len(...)`. So that part might go like this:
list_of_links = soup.select("div.n_head2 a[href]")
if len(list_of_links)==0:
break
else:
for a in soup.select("div.n_head2 a[href]"):
print (a["href"])
g.write(a["href"]+"\n")
i += 1
Apparently the page still display the latest page available if it's queried beyond the maximum page. So if the maximum page number available is 82 and you query page 83, it will give page 82. To detect this case, you can save the list of previous page urls, and compare it with current list of urls.
Here is the full code (tested):
from bs4 import BeautifulSoup
import urllib
import urlparse
f = open("downloadray2.txt")
g = open("downloadray3.txt", "w")
for line in f.readlines():
line = line.strip()
i = 1
prev_urls = []
while 1:
url = line+"?page=%d" % i
print 'Examining %s' % url
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
list_of_urls = soup.select("div.n_head2 a[href]")
if set(prev_urls)==set(list_of_urls):
break
else:
for a in soup.select("div.n_head2 a[href]"):
print (a["href"])
g.write(a["href"]+"\n")
i += 1
prev_urls = list_of_urls
Related
When I run the code below, the for loop saves the first text correctly into a separate file, but the second iteration saves the first AND the second into another separate file, and the third iteration saves the first, second and third into a separate file and so on.... I'd like to save each iteration into a separate file but not adding the previous iterations. I don't have a clue to what I'm missing here. Can anyone help, please?
import requests
from bs4 import BeautifulSoup
import pandas as pd
base_url = 'http://www.chakoteya.net/StarTrek/'
end_url = ['1.htm', '6.htm', '8.htm', '2.htm', '7.htm',
'5.htm', '4.htm', '10.htm', '12.htm', '11.htm', '3.htm', '16.htm']
episodes = []
count = 0
for end_url in end_url:
url = base_url + end_url
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
episodes.append(soup.text)
file_text = open(f"./{count}.txt", "w")
file_text.writelines()
file_text.close()
count = count + 1
print(f"saved file for url:{url}")
Please consider the following points!
there's no reason at all to use bs4! since response.text is actually holding the same.
You've to use Same Session explained on my previous answer
You can use iteration with fstring/format which will let your code more cleaner and easier to read.
with context manager is less headache as you don't need to remember to close your file after!
import requests
block = [9, 13, 14, 15]
def main(url):
with requests.Session() as req:
for page in range(1, 17):
if page not in block:
print(f'Extracing Page# {page}')
r = req.get(url.format(page))
with open(f'{page}.htm', 'w') as f:
f.write(r.text)
main('http://www.chakoteya.net/StarTrek/{}.htm')
You needed to empty your episodes for each iteration. Try the following:
import requests
from bs4 import BeautifulSoup
import pandas as pd
base_url = 'http://www.chakoteya.net/StarTrek/'
end_url = ['1.htm', '6.htm', '8.htm', '2.htm', '7.htm',
'5.htm', '4.htm', '10.htm', '12.htm', '11.htm', '3.htm', '16.htm']
count = 0
for end_url in end_url:
episodes = []
url = base_url + end_url
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
episodes.append(soup.text)
file_text = open(f"./{count}.txt", "w")
file_text.writelines(episodes)
file_text.close()
count = count + 1
print(f"saved file for url:{url}")
It doesn't appear that your code would save anything to the files at all as you are calling writelines with no arguments
if __name__ == '__main__':
import requests
from bs4 import BeautifulSoup
base_url = 'http://www.chakoteya.net/StarTrek/'
paths = ['1.htm', '6.htm', '8.htm', '2.htm', '7.htm',
'5.htm', '4.htm', '10.htm', '12.htm', '11.htm', '3.htm', '16.htm']
for path in paths:
url = f'{base_url}{path}'
filename = path.split('.')[0]
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
with open(f"./{filename}.txt", "w") as f:
f.write(soup.text)
print(f"saved file for url:{url}")
This is reworked a little. It wasn't clear why the data was appending to episodes so that was left off.
Maybe you were writing the list to the file which would account for dupes. You were adding the content to each file to a list and writing that growing list each iteration.
I am trying to write a program that opens a url, finds a name in a certain line, and saves it. Then it should find the url in the same line as the name, open it, and find the name + url in the same line # as the previous page. It should do this 4 times.
I can't get it to iterate through the new url parameter. It keeps returning the same name and url. What is going wrong here? Thanks.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import ssl
linklist = list()
namelist = list()
linelist = list()
count = 0
listposition = int(input("Please enter list position: "))
goodnamelist = list(["Fikret"])
nexturl = "http://py4e-data.dr-chuck.net/known_by_Fikret.html"
def listfunction(url):
ctx = ssl.create_default_context()
#Allows reading of HTTPS pages
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
linelist = soup('a')
for line in linelist:
#Creates list of lines in webpage:
linklist.append(re.findall("(http://.+)\"", str(line)))
#Creates list of names in line:
namelist.append(re.findall(">(.+)</a>", str(line)))
#Creates list of names in the designated user-input position:
goodnamelist.append(namelist[listposition][0])
nexturl = linklist[listposition][0]
return nexturl
while (count < 4):
nexturl = listfunction(nexturl)
print(listfunction(nexturl))
count += 1
print(nexturl)
continue
print(linelist)
print(linklist)
print(namelist)
print(nexturl)
print(goodnamelist)
print(listfunction(nexturl))
You do not actually set nexturl in listfunction(). Therefore the method just returns the same initial global variable every time.
I have a file called IP2.txt and this file contains 2 rows as shown below.
103.201.150.209
113.170.129.113
My code goes like this it reads the file IP2 and looks up the website search
import requests
from bs4 import BeautifulSoup
import re
fh = open('IP2.txt')
for line in fh:
ip = line.rstrip()
print(ip)
loads = {'q':line.rstrip(),'engine':1}
r = requests.get('https://fortiguard.com/search',params=loads)
# print(r.url)
# print(r.text)
Link_text = r.text
soup = BeautifulSoup(Link_text, 'lxml')
for product in soup.find_all('section', class_='iprep'):
product_title = product.find("a").text
print(ip+':'+ product_title)
fh.close()
The output of the above code is like this.
103.201.150.209
113.170.129.113
113.170.129.113:Malicious Websites
As you can see it's reading the last line and skipping the first value: 103.201.150.209
It seems like your indentation is not correct, causing lines that should be part of your loops to be executed only once after those loops are over. You are probably looking for this:
with open('IP2.txt') as fh:
for line in fh:
ip = line.rstrip()
print(ip)
loads = {'q':line.rstrip(), 'engine':1}
r = requests.get('https://fortiguard.com/search', params=loads)
# do the following for ALL ips
soup = BeautifulSoup(r.text, 'lxml')
for product in soup.find_all('section', class_='iprep'):
product_title = product.find("a").text
# print ALL products
print(ip + ':' + product_title)
Also note the use of with which will auto-close your file even if something goes wrong in between.
You are overriding r value every time in your for loop. You can create a list outside of your loop and append to it every time in loop. Other way would be to do all your BeautifulSoup operations and printing inside your for loop, then you will be getting your printout for every r.
I think you need to loop over the return values from requests:
import requests
from bs4 import BeautifulSoup
import re
with open('IP2.txt') as fh:
texts = []
for line in fh:
ip = line.rstrip()
print(ip)
loads = {'q':line.rstrip(),'engine':1}
r = requests.get('https://fortiguard.com/search',params=loads)
texts.append(r.text)
for text in texts:
soup = BeautifulSoup(text, 'lxml')
for product in soup.find_all('section', class_='iprep'):
product_title = product.find("a").text
print(ip+':'+ product_title)
I think what you need is :
import requests
from bs4 import BeautifulSoup
import re
fh = open('IP2.txt')
for line in fh:
ip = line.rstrip()
print(ip)
loads = {'q':line.rstrip(),'engine':1}
r = requests.get('https://fortiguard.com/search',params=loads)
Link_text = r.text
soup = BeautifulSoup(Link_text, 'lxml')
for product in soup.find_all('section', class_='iprep'):
product_title = product.find("a").text
print(ip+':'+ product_title)
fh.close()
Seems more like an indentation problem.
I am trying to copy all the data within an HTML page that has the certain class "chapter_header_styling" with BS4.
This was working when I manually inputed the URL – but is tedious when there are multiple books and various chapters. So I then created another script that would generate all the chapter URLs for the book and combine them into a text file bchap.txt (book chapters).
Since then I have altered the file and added various break points so ignore my lack of comments and unused arrays/list. I have narrowed it down to the ###Comment## where it doesn't work. It's probably not nested right but I'm not sure... I had this working to a point but can't figure out why it won't paste the mydivs data into the book.html file. If anyone with more experience could point me in the right direction much would be appreciated.
#mkbook.py
# coding: utf-8
from bs4 import BeautifulSoup
import requests
LINK = "https://codes.iccsafe.org/content/FAC2017"
pop = ""
#z = ""
chapters = open("bchap.txt",'r')
a = []
for aline in chapters:
chap = aline
#print (chap)
#pop = ""
pop = LINK+chap
#print (pop)
r = requests.get(pop)
data = r.text
#print(data)
soup = BeautifulSoup(data, 'html.parser')
mydivs = soup.findAll("div", {"class": ["annotator", "chapter_header_styling"]})
f = open("BOOK.html","a")
f.write("test <br/>")
########################################
#MY PROBLEM IS BELOW NOT PRINTING DIV DATA INTO TXT FILE
########################################
for div in mydivs:
print (div)
z = str(div)
print(z) #doesn't printout...why???
f.write(z)
print len(mydivs)
f.close()
chapters.close()
##############################################
## this is the old mkbook.py code before I looped it - inputing url 1 # time
#
# coding: utf-8
from bs4 import BeautifulSoup
import requests
r = requests.get("https://codes.iccsafe.org/content/FAC2017/preface")
data = r.text
soup = BeautifulSoup(data, 'html.parser')
a = []
mydivs = soup.findAll("div",{"class":["annotator",
"chapter_header_styling"]})
f = open("BOOK.html","a")
for div in mydivs:
z = str(div)
f.write(z)
f.close()
print len(mydivs) #outputs 1 if copied div data.
#######################################
#mkchap.py
# coding: utf-8
from bs4 import BeautifulSoup
import requests
r = requests.get("https://codes.iccsafe.org/content/FAC2017")
data = r.text
soup = BeautifulSoup(data, 'html.parser')
a = []
soup.findAll('option',{"value":True})
list = soup.findAll('option')
with open('bchap.txt', 'w') as filehandle:
for l in list:
filehandle.write(l['value'])
filehandle.write("\n")
print l['value']
#with open('bchap.txt', 'w') as filehandle:
# filehandle.write("%s\n" % list)
filehandle.close()
The problem seems to be that you are constructing your url using a wrong base url.
LINK = "https://codes.iccsafe.org/content/FAC2017"
If you take a look at your 1st request you can see this clearly.
print(pop)
print(r.status_code)
Outputs:
https://codes.iccsafe.org/content/FAC2017/content/FAC2017
404
After running the code to populate bchap.txt, its output is
/content/FAC2017
/content/FAC2017/legend
/content/FAC2017/copyright
/content/FAC2017/preface
/content/FAC2017/chapter-1-application-and-administration
/content/FAC2017/chapter-2-scoping-requirements
/content/FAC2017/chapter-3-building-blocks
/content/FAC2017/chapter-4-accessible-routes
/content/FAC2017/chapter-5-general-site-and-building-elements
/content/FAC2017/chapter-6-plumbing-elements-and-facilities
/content/FAC2017/chapter-7-communication-elements-and-features
/content/FAC2017/chapter-8-special-rooms-spaces-and-elements
/content/FAC2017/chapter-9-built-in-elements
/content/FAC2017/chapter-10-recreation-facilities
/content/FAC2017/list-of-figures
/content/FAC2017/fair-housing-accessibility-guidelines-design-guidelines-for-accessible-adaptable-dwellings
/content/FAC2017/advisory
Lets change the base url first and try again.
from bs4 import BeautifulSoup
import requests
LINK = "https://codes.iccsafe.org"
pop = ""
chapters = open("bchap.txt",'r')
a = []
for aline in chapters:
chap = aline
pop = LINK+chap
r = requests.get(pop)
print(pop)
print(r.status_code)
chapters.close()
Outputs:
https://codes.iccsafe.org/content/FAC2017
404
...
why? b'coz of the \n. If we do a
print(repr(pop))
It will output
'https://codes.iccsafe.org/content/FAC2017\n'
You'll have to strip away that \n also. The final code that worked is
from bs4 import BeautifulSoup
import requests
LINK = "https://codes.iccsafe.org"
pop = ""
chapters = open("bchap.txt",'r')
a = []
for aline in chapters:
chap = aline
pop = LINK+chap
r = requests.get(pop.strip())
data = r.text
soup = BeautifulSoup(data, 'html.parser')
mydivs = soup.findAll("div", class_="annotator chapter_header_styling")
f = open("BOOK.html","a")
for div in mydivs:
z = str(div)
f.write(z)
f.close()
chapters.close()
I want to execute a loop if and only 5 lines have been executed inside the text file that's being written to. The reason being, I want the average to be calculated from the final 5 lines of the text file and if the program doesn't have 5 numbers to work with, then a rumtime error is thrown.
#Imports
from bs4 import BeautifulSoup
from urllib import urlopen
import time
#Required Fields
pageCount = 1290429
#Loop
logFile = open("PastWinners.txt", "r+")
logFile.truncate()
while(pageCount>0):
time.sleep(1)
html = urlopen('https://www.csgocrash.com/game/1/%s' % (pageCount)).read()
soup = BeautifulSoup(html, "html.parser")
try:
section = soup.find('div', {"class":"row panel radius"})
crashPoint = section.find("b", text="Crashed At: ").next_sibling.strip()
logFile.write(crashPoint[0:-1]+"\n")
except:
continue
for i, line in enumerate(logFile): #After 5 lines, execute this
if i > 4:
data = [float(line.rstrip()) for line in logFile]
print("Average: " + "{0:0.2f}".format(sum(data[-5:])/len(data[-5:])))
else:
continue
print(crashPoint[0:-1])
pageCount+=1
logFile.close()
If anyone knows the solution, or knows a better way to go about doing this, it would be helpful, thanks :).
Edit:
Updated Code:
#Imports
from bs4 import BeautifulSoup
from urllib import urlopen
import time
#Required Fields
pageCount = 1290429
lineCount = 0
def FindAverage():
with open('PastWinners.txt') as logFile:
data = [float(line.rstrip()) for line in logFile]
print("Average: " + "{0:0.2f}".format(sum(data[-5:])/len(data[-5:])))
#Loop
logFile = open("PastWinners.txt", "r+")
logFile.truncate()
while(pageCount>0):
time.sleep(1)
html = urlopen('https://www.csgocrash.com/game/1/%s' % (pageCount)).read()
soup = BeautifulSoup(html, "html.parser")
if lineCount > 4:
logFile.close()
FindAverage()
else:
continue
try:
section = soup.find('div', {"class":"row panel radius"})
crashPoint = section.find("b", text="Crashed At: ").next_sibling.strip()
logFile.write(crashPoint[0:-1]+"\n")
except:
continue
print(crashPoint[0:-1])
pageCount+=1
lineCount+=1
logFile.close()
New Problem:
The program runs as expected, however once the average is calculated and displayed, the program doesn't loop again, it stops. I want it to work so after 5 lines it calculates the average and then displays the next number, then displays a new average and so on and so.
Your while loop is never going to end. I think you meant to decrement: pageCount-=1.
The problem at the end was that the loop wasn't restarting and just finishing on the first average calculation. This was due to the logFile being closed and not being reopen, so the program thought it and appending to the file, it works just as expected. Thanks to all for the help.
#Imports
from bs4 import BeautifulSoup
from urllib import urlopen
import time
#Required Fields
pageCount = 1290429
lineCount = 0
def FindAverage():
with open('PastWinners.txt') as logFile:
data = [float(line.rstrip()) for line in logFile]
print("Average: " + "{0:0.2f}".format(sum(data[-5:])/len(data[-5:])))
#Loop
logFile = open("PastWinners.txt", "r+")
logFile.truncate()
while(pageCount>0):
time.sleep(1)
html = urlopen('https://www.csgocrash.com/game/1/%s' % (pageCount)).read()
soup = BeautifulSoup(html, "html.parser")
try:
section = soup.find('div', {"class":"row panel radius"})
crashPoint = section.find("b", text="Crashed At: ").next_sibling.strip()
logFile.write(crashPoint[0:-1]+"\n")
except:
continue
print(crashPoint[0:-1])
pageCount+=1
lineCount+=1
if lineCount > 4:
logFile.close()
FindAverage()
logFile = open("PastWinners.txt", "a+")
else:
continue
logFile.close()