Trigger loop after n lines in a text file - python

I want to execute a loop if and only 5 lines have been executed inside the text file that's being written to. The reason being, I want the average to be calculated from the final 5 lines of the text file and if the program doesn't have 5 numbers to work with, then a rumtime error is thrown.
#Imports
from bs4 import BeautifulSoup
from urllib import urlopen
import time
#Required Fields
pageCount = 1290429
#Loop
logFile = open("PastWinners.txt", "r+")
logFile.truncate()
while(pageCount>0):
time.sleep(1)
html = urlopen('https://www.csgocrash.com/game/1/%s' % (pageCount)).read()
soup = BeautifulSoup(html, "html.parser")
try:
section = soup.find('div', {"class":"row panel radius"})
crashPoint = section.find("b", text="Crashed At: ").next_sibling.strip()
logFile.write(crashPoint[0:-1]+"\n")
except:
continue
for i, line in enumerate(logFile): #After 5 lines, execute this
if i > 4:
data = [float(line.rstrip()) for line in logFile]
print("Average: " + "{0:0.2f}".format(sum(data[-5:])/len(data[-5:])))
else:
continue
print(crashPoint[0:-1])
pageCount+=1
logFile.close()
If anyone knows the solution, or knows a better way to go about doing this, it would be helpful, thanks :).
Edit:
Updated Code:
#Imports
from bs4 import BeautifulSoup
from urllib import urlopen
import time
#Required Fields
pageCount = 1290429
lineCount = 0
def FindAverage():
with open('PastWinners.txt') as logFile:
data = [float(line.rstrip()) for line in logFile]
print("Average: " + "{0:0.2f}".format(sum(data[-5:])/len(data[-5:])))
#Loop
logFile = open("PastWinners.txt", "r+")
logFile.truncate()
while(pageCount>0):
time.sleep(1)
html = urlopen('https://www.csgocrash.com/game/1/%s' % (pageCount)).read()
soup = BeautifulSoup(html, "html.parser")
if lineCount > 4:
logFile.close()
FindAverage()
else:
continue
try:
section = soup.find('div', {"class":"row panel radius"})
crashPoint = section.find("b", text="Crashed At: ").next_sibling.strip()
logFile.write(crashPoint[0:-1]+"\n")
except:
continue
print(crashPoint[0:-1])
pageCount+=1
lineCount+=1
logFile.close()
New Problem:
The program runs as expected, however once the average is calculated and displayed, the program doesn't loop again, it stops. I want it to work so after 5 lines it calculates the average and then displays the next number, then displays a new average and so on and so.

Your while loop is never going to end. I think you meant to decrement: pageCount-=1.

The problem at the end was that the loop wasn't restarting and just finishing on the first average calculation. This was due to the logFile being closed and not being reopen, so the program thought it and appending to the file, it works just as expected. Thanks to all for the help.
#Imports
from bs4 import BeautifulSoup
from urllib import urlopen
import time
#Required Fields
pageCount = 1290429
lineCount = 0
def FindAverage():
with open('PastWinners.txt') as logFile:
data = [float(line.rstrip()) for line in logFile]
print("Average: " + "{0:0.2f}".format(sum(data[-5:])/len(data[-5:])))
#Loop
logFile = open("PastWinners.txt", "r+")
logFile.truncate()
while(pageCount>0):
time.sleep(1)
html = urlopen('https://www.csgocrash.com/game/1/%s' % (pageCount)).read()
soup = BeautifulSoup(html, "html.parser")
try:
section = soup.find('div', {"class":"row panel radius"})
crashPoint = section.find("b", text="Crashed At: ").next_sibling.strip()
logFile.write(crashPoint[0:-1]+"\n")
except:
continue
print(crashPoint[0:-1])
pageCount+=1
lineCount+=1
if lineCount > 4:
logFile.close()
FindAverage()
logFile = open("PastWinners.txt", "a+")
else:
continue
logFile.close()

Related

Saving a "for loop" iteration

When I run the code below, the for loop saves the first text correctly into a separate file, but the second iteration saves the first AND the second into another separate file, and the third iteration saves the first, second and third into a separate file and so on.... I'd like to save each iteration into a separate file but not adding the previous iterations. I don't have a clue to what I'm missing here. Can anyone help, please?
import requests
from bs4 import BeautifulSoup
import pandas as pd
base_url = 'http://www.chakoteya.net/StarTrek/'
end_url = ['1.htm', '6.htm', '8.htm', '2.htm', '7.htm',
'5.htm', '4.htm', '10.htm', '12.htm', '11.htm', '3.htm', '16.htm']
episodes = []
count = 0
for end_url in end_url:
url = base_url + end_url
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
episodes.append(soup.text)
file_text = open(f"./{count}.txt", "w")
file_text.writelines()
file_text.close()
count = count + 1
print(f"saved file for url:{url}")
Please consider the following points!
there's no reason at all to use bs4! since response.text is actually holding the same.
You've to use Same Session explained on my previous answer
You can use iteration with fstring/format which will let your code more cleaner and easier to read.
with context manager is less headache as you don't need to remember to close your file after!
import requests
block = [9, 13, 14, 15]
def main(url):
with requests.Session() as req:
for page in range(1, 17):
if page not in block:
print(f'Extracing Page# {page}')
r = req.get(url.format(page))
with open(f'{page}.htm', 'w') as f:
f.write(r.text)
main('http://www.chakoteya.net/StarTrek/{}.htm')
You needed to empty your episodes for each iteration. Try the following:
import requests
from bs4 import BeautifulSoup
import pandas as pd
base_url = 'http://www.chakoteya.net/StarTrek/'
end_url = ['1.htm', '6.htm', '8.htm', '2.htm', '7.htm',
'5.htm', '4.htm', '10.htm', '12.htm', '11.htm', '3.htm', '16.htm']
count = 0
for end_url in end_url:
episodes = []
url = base_url + end_url
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
episodes.append(soup.text)
file_text = open(f"./{count}.txt", "w")
file_text.writelines(episodes)
file_text.close()
count = count + 1
print(f"saved file for url:{url}")
It doesn't appear that your code would save anything to the files at all as you are calling writelines with no arguments
if __name__ == '__main__':
import requests
from bs4 import BeautifulSoup
base_url = 'http://www.chakoteya.net/StarTrek/'
paths = ['1.htm', '6.htm', '8.htm', '2.htm', '7.htm',
'5.htm', '4.htm', '10.htm', '12.htm', '11.htm', '3.htm', '16.htm']
for path in paths:
url = f'{base_url}{path}'
filename = path.split('.')[0]
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
with open(f"./{filename}.txt", "w") as f:
f.write(soup.text)
print(f"saved file for url:{url}")
This is reworked a little. It wasn't clear why the data was appending to episodes so that was left off.
Maybe you were writing the list to the file which would account for dupes. You were adding the content to each file to a list and writing that growing list each iteration.

I would like to find if the new found links from Beautiful soup is already in the queue.txt file and crawled.txt file

I have a beautiful soup program where I find all the links on a webpage and put it in a queue.txt file. The program then gets each link from the file and find all the links on those links. They then get put into a crawled.txt file for all the crawled links.
I want to make sure I get no duplicates so I want the program to go through the queue.txt and crawled.txt and if the links that have just been found are in those files, then the new found links shouldn't be put in the file
I have tried doing it so that it prints the newly found links into a list and removes duplicates from there and prints the list to a .txt file but it wouldn't be able to tell what is in the queue file it only removes duplicates from the newly found links from the one page.
This is the code:
from bs4 import BeautifulSoup
import requests
import re
from urllib.parse import urlparse
def get_links(base_url, file_name):
page = requests.get(base_url)
soup = BeautifulSoup(page.content, 'html.parser')
single_slash = re.compile(r'^/\w')
double_slash = re.compile(r'^//\w')
parsed_uri = urlparse(base_url)
domain_name = '{uri.scheme}://{uri.netloc}'.format(uri=parsed_uri)
with open(file_name, "a") as f:
for tag in soup.find_all('a'):
link = str(tag.get('href'))
if str(link).startswith("http"):
link = link
print(link)
if double_slash.match(link):
link = 'https:' + link
print(link)
if single_slash.match(link):
link = domain_name + link
print(link)
if str(link).startswith("#"):
continue
if str(link).startswith("j"):
continue
if str(link).startswith('q'):
continue
if str(link).startswith('u'):
continue
if str(link).startswith('N'):
continue
if str(link).startswith('m'):
continue
try:
f.write(link + '\n')
except:
pass
get_links('https://stackabuse.com/reading-and-writing-lists-to-a-file-in-python/', "queue.txt")
with open('queue.txt') as f:
lines = f.read().splitlines()
print(lines)
for link in lines:
if lines[0] == "/":
del lines[0]
print(lines[0])
with open('crawled.txt', 'a') as h:
h.write('%s\n' % lines[0])
h.close()
del lines[0]
if lines[0] == "/":
del lines[0]
with open('queue.txt', 'w') as filehandle:
for listitem in lines:
filehandle.write('%s\n' % listitem)
page_url = lines[0]
get_links(page_url, "queue.txt")
print(lines)
with open('queue.txt') as f:
lines = f.read().splitlines()
In general for Python, when trying to remove duplicates, sets are usually a good bet. For example:
lines = open('queue.txt', 'r').readlines()
queue_set = set(lines)
result = open('queue.txt', 'w')
for line in queue_set:
result.write(line)
Note: This will not preserve the order of the links, but I don't see a reason for that in this case.
Also, this was adapted from this answer.

My python code looks up only last value from the loop

I have a file called IP2.txt and this file contains 2 rows as shown below.
103.201.150.209
113.170.129.113
My code goes like this it reads the file IP2 and looks up the website search
import requests
from bs4 import BeautifulSoup
import re
fh = open('IP2.txt')
for line in fh:
ip = line.rstrip()
print(ip)
loads = {'q':line.rstrip(),'engine':1}
r = requests.get('https://fortiguard.com/search',params=loads)
# print(r.url)
# print(r.text)
Link_text = r.text
soup = BeautifulSoup(Link_text, 'lxml')
for product in soup.find_all('section', class_='iprep'):
product_title = product.find("a").text
print(ip+':'+ product_title)
fh.close()
The output of the above code is like this.
103.201.150.209
113.170.129.113
113.170.129.113:Malicious Websites
As you can see it's reading the last line and skipping the first value: 103.201.150.209
It seems like your indentation is not correct, causing lines that should be part of your loops to be executed only once after those loops are over. You are probably looking for this:
with open('IP2.txt') as fh:
for line in fh:
ip = line.rstrip()
print(ip)
loads = {'q':line.rstrip(), 'engine':1}
r = requests.get('https://fortiguard.com/search', params=loads)
# do the following for ALL ips
soup = BeautifulSoup(r.text, 'lxml')
for product in soup.find_all('section', class_='iprep'):
product_title = product.find("a").text
# print ALL products
print(ip + ':' + product_title)
Also note the use of with which will auto-close your file even if something goes wrong in between.
You are overriding r value every time in your for loop. You can create a list outside of your loop and append to it every time in loop. Other way would be to do all your BeautifulSoup operations and printing inside your for loop, then you will be getting your printout for every r.
I think you need to loop over the return values from requests:
import requests
from bs4 import BeautifulSoup
import re
with open('IP2.txt') as fh:
texts = []
for line in fh:
ip = line.rstrip()
print(ip)
loads = {'q':line.rstrip(),'engine':1}
r = requests.get('https://fortiguard.com/search',params=loads)
texts.append(r.text)
for text in texts:
soup = BeautifulSoup(text, 'lxml')
for product in soup.find_all('section', class_='iprep'):
product_title = product.find("a").text
print(ip+':'+ product_title)
I think what you need is :
import requests
from bs4 import BeautifulSoup
import re
fh = open('IP2.txt')
for line in fh:
ip = line.rstrip()
print(ip)
loads = {'q':line.rstrip(),'engine':1}
r = requests.get('https://fortiguard.com/search',params=loads)
Link_text = r.text
soup = BeautifulSoup(Link_text, 'lxml')
for product in soup.find_all('section', class_='iprep'):
product_title = product.find("a").text
print(ip+':'+ product_title)
fh.close()
Seems more like an indentation problem.

How can I save the results of my web scraping to a text file in beautiful soup?

I tried to write this code for some web scraping. The code works fine, but I still have trouble figuring out how I can save the results of my web scraping into a .txt file? I want to write the result of "print(div.text)" output into a .txt file.
import bs4 as bs
import urllib.request
for pg in range(1, 100 + 1):
source = urllib.request.urlopen('https://dsalsrv04.uchicago.edu/cgi-bin/app/hayyim_query.py?page='+ str(pg)).read()
soup = bs.BeautifulSoup(source,'lxml')
for div in soup.find_all('div', class_='hw_result'):
print(div.text)
Maybe, with f.open, f.write and f.close:
import bs4 as bs
import urllib.request
import re
output = ''
for pg in range(1, 100 + 1):
source = urllib.request.urlopen('https://dsalsrv04.uchicago.edu/cgi-bin/app/hayyim_query.py?page='+ str(pg)).read()
soup = bs.BeautifulSoup(source,'lxml')
for div in soup.find_all('div', class_='hw_result'):
output += div.text
output = re.sub(r"[\r\n]+", "", output)
f = open('/any/directory_you_like/any_name_that_you_like_with_any_extension.txt', 'w')
try:
f.write(output)
finally:
f.close()
Open a file before the loop
file = open(“testfile.txt”, “w”)
First argument is the file name and the second means that you want to write in this file
And then instead of print(div.text) you should use file.write(div.text)
Close the file after the loop with file.close()
After all your code should be like this:
import bs4 as bs
import urllib.request
file = open(“testfile.txt”, “w”)
for pg in range(1, 100 + 1):
source = urllib.request.urlopen('https://dsalsrv04.uchicago.edu/cgi-bin/app/hayyim_query.py?page='+ str(pg)).read()
soup = bs.BeautifulSoup(source,'lxml')
for div in soup.find_all('div', class_='hw_result'):
file.write(div.text)
file.close()

How to get all application link from the log text file?

I have a log file which contains:
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/
http://www.downloadray.com/windows/Photos_and_Images/Graphic_Capture/
http://www.downloadray.com/windows/Photos_and_Images/Digital_Photo_Tools/
I have this code:
from bs4 import BeautifulSoup
import urllib
import urlparse
f = open("downloadray2.txt")
g = open("downloadray3.txt", "w")
for line in f.readlines():
i = 1
while 1:
url = line+"?page=%d" % i
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
has_more = 1
for a in soup.select("div.n_head2 a[href]"):
try:
print (a["href"])
g.write(a["href"]+"\n")
except:
print "no link"
if has_more:
i += 1
else:
break
This code do not give error but it do not working.
I tried modified it but can't solved it.
But when I try this code,it works well:
from bs4 import BeautifulSoup
import urllib
import urlparse
g = open("downloadray3.txt", "w")
url = "http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/"
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)
i = 1
while 1:
url1 = url+"?page=%d" % i
pageHtml = urllib.urlopen(url1)
soup = BeautifulSoup(pageHtml)
has_more = 2
for a in soup.select("div.n_head2 a[href]"):
try:
print (a["href"])
g.write(a["href"]+"\n")
except:
print "no link"
if has_more:
i += 1
else:
break
So how can I make it can read from the log text file. It is hard to take link one by one to be read.
Have you stripped the newline from the end of the line?
for line in f.readlines():
line = line.strip()
readlines() will produce a list of lines taken from the file including the newline \n character.
Proof Evidence by printing url variable (after the line url = line+"?page=%d" % i):
Your original code:
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/
?page=1
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/
?page=2
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/
?page=3
With my suggested fix:
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/?page=1
http://www.downloadray.com/TIFF-to-JPG_download/
http://www.downloadray.com/Moo0-Image-Thumbnailer_download/
http://www.downloadray.com/Moo0-Image-Sizer_download/
http://www.downloadray.com/Advanced-Image-Viewer-and-Converter_download/
http://www.downloadray.com/GandMIC_download/
http://www.downloadray.com/SendTo-Convert_download/
http://www.downloadray.com/PNG-To-JPG-Converter-Software_download/
http://www.downloadray.com/Graphics-Converter-Pro_download/
http://www.downloadray.com/PICtoC_download/
http://www.downloadray.com/Free-Images-Converter_download/
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/?page=2
http://www.downloadray.com/VarieDrop_download/
http://www.downloadray.com/Tinuous_download/
http://www.downloadray.com/Acme-CAD-Converter_download/
http://www.downloadray.com/AAOImageConverterandFTP_download/
http://www.downloadray.com/ImageCool-Converter_download/
http://www.downloadray.com/GeoJpeg_download/
http://www.downloadray.com/Android-Resizer-Tool_download/
http://www.downloadray.com/Scarab-Darkroom_download/
http://www.downloadray.com/Jpeg-Resizer_download/
http://www.downloadray.com/TIFF2PDF_download/
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/?page=3
http://www.downloadray.com/JGraphite_download/
http://www.downloadray.com/Easy-PNG-to-Icon-Converter_download/
http://www.downloadray.com/JBatch-It!_download/
http://www.downloadray.com/Batch-It!-Pro_download/
http://www.downloadray.com/Batch-It!-Ultra_download/
http://www.downloadray.com/Image-to-Ico-Converter_download/
http://www.downloadray.com/PSD-To-PNG-Converter-Software_download/
http://www.downloadray.com/VectorNow_download/
http://www.downloadray.com/KeitiklImages_download/
http://www.downloadray.com/STOIK-Smart-Resizer_download/
Update:
Then again, this code won't run as expected, because the while loop will never continue since the has_more variable is never changed.
You know that you don't have more links when the list returned by `soup.select(...)` is empty. You can check for emptiness using `len(...)`. So that part might go like this:
list_of_links = soup.select("div.n_head2 a[href]")
if len(list_of_links)==0:
break
else:
for a in soup.select("div.n_head2 a[href]"):
print (a["href"])
g.write(a["href"]+"\n")
i += 1
Apparently the page still display the latest page available if it's queried beyond the maximum page. So if the maximum page number available is 82 and you query page 83, it will give page 82. To detect this case, you can save the list of previous page urls, and compare it with current list of urls.
Here is the full code (tested):
from bs4 import BeautifulSoup
import urllib
import urlparse
f = open("downloadray2.txt")
g = open("downloadray3.txt", "w")
for line in f.readlines():
line = line.strip()
i = 1
prev_urls = []
while 1:
url = line+"?page=%d" % i
print 'Examining %s' % url
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
list_of_urls = soup.select("div.n_head2 a[href]")
if set(prev_urls)==set(list_of_urls):
break
else:
for a in soup.select("div.n_head2 a[href]"):
print (a["href"])
g.write(a["href"]+"\n")
i += 1
prev_urls = list_of_urls

Categories

Resources