When I run the code below, the for loop saves the first text correctly into a separate file, but the second iteration saves the first AND the second into another separate file, and the third iteration saves the first, second and third into a separate file and so on.... I'd like to save each iteration into a separate file but not adding the previous iterations. I don't have a clue to what I'm missing here. Can anyone help, please?
import requests
from bs4 import BeautifulSoup
import pandas as pd
base_url = 'http://www.chakoteya.net/StarTrek/'
end_url = ['1.htm', '6.htm', '8.htm', '2.htm', '7.htm',
'5.htm', '4.htm', '10.htm', '12.htm', '11.htm', '3.htm', '16.htm']
episodes = []
count = 0
for end_url in end_url:
url = base_url + end_url
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
episodes.append(soup.text)
file_text = open(f"./{count}.txt", "w")
file_text.writelines()
file_text.close()
count = count + 1
print(f"saved file for url:{url}")
Please consider the following points!
there's no reason at all to use bs4! since response.text is actually holding the same.
You've to use Same Session explained on my previous answer
You can use iteration with fstring/format which will let your code more cleaner and easier to read.
with context manager is less headache as you don't need to remember to close your file after!
import requests
block = [9, 13, 14, 15]
def main(url):
with requests.Session() as req:
for page in range(1, 17):
if page not in block:
print(f'Extracing Page# {page}')
r = req.get(url.format(page))
with open(f'{page}.htm', 'w') as f:
f.write(r.text)
main('http://www.chakoteya.net/StarTrek/{}.htm')
You needed to empty your episodes for each iteration. Try the following:
import requests
from bs4 import BeautifulSoup
import pandas as pd
base_url = 'http://www.chakoteya.net/StarTrek/'
end_url = ['1.htm', '6.htm', '8.htm', '2.htm', '7.htm',
'5.htm', '4.htm', '10.htm', '12.htm', '11.htm', '3.htm', '16.htm']
count = 0
for end_url in end_url:
episodes = []
url = base_url + end_url
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
episodes.append(soup.text)
file_text = open(f"./{count}.txt", "w")
file_text.writelines(episodes)
file_text.close()
count = count + 1
print(f"saved file for url:{url}")
It doesn't appear that your code would save anything to the files at all as you are calling writelines with no arguments
if __name__ == '__main__':
import requests
from bs4 import BeautifulSoup
base_url = 'http://www.chakoteya.net/StarTrek/'
paths = ['1.htm', '6.htm', '8.htm', '2.htm', '7.htm',
'5.htm', '4.htm', '10.htm', '12.htm', '11.htm', '3.htm', '16.htm']
for path in paths:
url = f'{base_url}{path}'
filename = path.split('.')[0]
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
with open(f"./{filename}.txt", "w") as f:
f.write(soup.text)
print(f"saved file for url:{url}")
This is reworked a little. It wasn't clear why the data was appending to episodes so that was left off.
Maybe you were writing the list to the file which would account for dupes. You were adding the content to each file to a list and writing that growing list each iteration.
Related
So I was trying to make a filter that filter's out the crap from this scrape, but I have an issue where it filters out the words. I would like to filter out the whole line instead of the words.
from bs4 import BeautifulSoup
import requests
import os
def Scrape():
page = input("Page: ")
url = "https://openuserjs.org/?p=" + page
source = requests.get(url)
soup = BeautifulSoup(source.text,'lxml')
os.system('cls')
Filter(soup)
def Filter(soup):
crap = ""
f = open("Data/Crap.txt", "r")
for craptext in f:
crap = craptext
for Titles in soup.select("a.tr-link-a>b"):
print(Titles.text.replace(crap, "").strip())
while True:
Scrape()
Instead of:
print(Titles.text.replace(crap, "").strip())
Try using:
if crap not in Titles.text:
print(Titles.text.strip())
I am trying to find all the photo links in a website, and I use BeautifulSoup for it.
Here are my codes:
import requests
from bs4 import BeautifulSoup as bs
url = "http://cupp.cyberport.hk/zh_TW/front_programmes/index"
webpage = requests.get(url)
soup = bs(webpage.content, "html.parser")
images = []
for img in soup.findAll('img'):
images.append(img.get('src'))
with open("photo_links.txt", "w") as text_file:
text_file.write(str(images))
And the results are
['https://www.cyberport.hk/images/logo.jpg','https://www.cyberport.hk/img/weather_icon/black/54.png','https://www.cyberport.hk/images/facebook.jpg', 'https://www.cyberport.hk/images/twitter.jpg','https://www.cyberport.hk/images/linkin.jpg', 'http://cupp.cyberport.hk/files/general_content/upload/12/hkcityu_logo.jpg','http://cupp.cyberport.hk/files/general_content/upload/13/hkbu_logo.jpg']
All items in the list were printed in one single line in txt file.
I want each item to be separated with "\n"
like this
['https://www.cyberport.hk/images/logo.jpg',
'https://www.cyberport.hk/img/weather_icon/black/54.png',
'https://www.cyberport.hk/images/facebook.jpg',
'https://www.cyberport.hk/images/twitter.jpg',
'https://www.cyberport.hk/images/linkin.jpg',
'http://cupp.cyberport.hk/files/general_content/upload/12/hkcityu_logo.jpg',
'http://cupp.cyberport.hk/files/general_content/upload/13/hkbu_logo.jpg']
How can I modify the code so that I can get my preferred results?
Thank you.
You can do this:
import requests
from bs4 import BeautifulSoup as bs
url = "http://cupp.cyberport.hk/zh_TW/front_programmes/index"
webpage = requests.get(url)
soup = bs(webpage.content, "html.parser")
images = []
for img in soup.findAll('img'):
images.append(img.get('src'))
url_list = '",\n"'.join(images)
with open("../test_files/photo_links.txt", "w") as text_file:
text_file.write(f'"{url_list}",')
'\n'.join(images) creates a string of the items in images joined by \n.
could you try below solution?
amend the below code
text_file.write(str(images))
to the below code
text_file.write(str(images)+'\n')
You can achieve it using string formatting. Just inject concatenated list elements with ',\n' between 2 square brackets:
text_file.write(f"[{',\n'.join(images)}]")
I have a file called IP2.txt and this file contains 2 rows as shown below.
103.201.150.209
113.170.129.113
My code goes like this it reads the file IP2 and looks up the website search
import requests
from bs4 import BeautifulSoup
import re
fh = open('IP2.txt')
for line in fh:
ip = line.rstrip()
print(ip)
loads = {'q':line.rstrip(),'engine':1}
r = requests.get('https://fortiguard.com/search',params=loads)
# print(r.url)
# print(r.text)
Link_text = r.text
soup = BeautifulSoup(Link_text, 'lxml')
for product in soup.find_all('section', class_='iprep'):
product_title = product.find("a").text
print(ip+':'+ product_title)
fh.close()
The output of the above code is like this.
103.201.150.209
113.170.129.113
113.170.129.113:Malicious Websites
As you can see it's reading the last line and skipping the first value: 103.201.150.209
It seems like your indentation is not correct, causing lines that should be part of your loops to be executed only once after those loops are over. You are probably looking for this:
with open('IP2.txt') as fh:
for line in fh:
ip = line.rstrip()
print(ip)
loads = {'q':line.rstrip(), 'engine':1}
r = requests.get('https://fortiguard.com/search', params=loads)
# do the following for ALL ips
soup = BeautifulSoup(r.text, 'lxml')
for product in soup.find_all('section', class_='iprep'):
product_title = product.find("a").text
# print ALL products
print(ip + ':' + product_title)
Also note the use of with which will auto-close your file even if something goes wrong in between.
You are overriding r value every time in your for loop. You can create a list outside of your loop and append to it every time in loop. Other way would be to do all your BeautifulSoup operations and printing inside your for loop, then you will be getting your printout for every r.
I think you need to loop over the return values from requests:
import requests
from bs4 import BeautifulSoup
import re
with open('IP2.txt') as fh:
texts = []
for line in fh:
ip = line.rstrip()
print(ip)
loads = {'q':line.rstrip(),'engine':1}
r = requests.get('https://fortiguard.com/search',params=loads)
texts.append(r.text)
for text in texts:
soup = BeautifulSoup(text, 'lxml')
for product in soup.find_all('section', class_='iprep'):
product_title = product.find("a").text
print(ip+':'+ product_title)
I think what you need is :
import requests
from bs4 import BeautifulSoup
import re
fh = open('IP2.txt')
for line in fh:
ip = line.rstrip()
print(ip)
loads = {'q':line.rstrip(),'engine':1}
r = requests.get('https://fortiguard.com/search',params=loads)
Link_text = r.text
soup = BeautifulSoup(Link_text, 'lxml')
for product in soup.find_all('section', class_='iprep'):
product_title = product.find("a").text
print(ip+':'+ product_title)
fh.close()
Seems more like an indentation problem.
I am trying to copy all the data within an HTML page that has the certain class "chapter_header_styling" with BS4.
This was working when I manually inputed the URL – but is tedious when there are multiple books and various chapters. So I then created another script that would generate all the chapter URLs for the book and combine them into a text file bchap.txt (book chapters).
Since then I have altered the file and added various break points so ignore my lack of comments and unused arrays/list. I have narrowed it down to the ###Comment## where it doesn't work. It's probably not nested right but I'm not sure... I had this working to a point but can't figure out why it won't paste the mydivs data into the book.html file. If anyone with more experience could point me in the right direction much would be appreciated.
#mkbook.py
# coding: utf-8
from bs4 import BeautifulSoup
import requests
LINK = "https://codes.iccsafe.org/content/FAC2017"
pop = ""
#z = ""
chapters = open("bchap.txt",'r')
a = []
for aline in chapters:
chap = aline
#print (chap)
#pop = ""
pop = LINK+chap
#print (pop)
r = requests.get(pop)
data = r.text
#print(data)
soup = BeautifulSoup(data, 'html.parser')
mydivs = soup.findAll("div", {"class": ["annotator", "chapter_header_styling"]})
f = open("BOOK.html","a")
f.write("test <br/>")
########################################
#MY PROBLEM IS BELOW NOT PRINTING DIV DATA INTO TXT FILE
########################################
for div in mydivs:
print (div)
z = str(div)
print(z) #doesn't printout...why???
f.write(z)
print len(mydivs)
f.close()
chapters.close()
##############################################
## this is the old mkbook.py code before I looped it - inputing url 1 # time
#
# coding: utf-8
from bs4 import BeautifulSoup
import requests
r = requests.get("https://codes.iccsafe.org/content/FAC2017/preface")
data = r.text
soup = BeautifulSoup(data, 'html.parser')
a = []
mydivs = soup.findAll("div",{"class":["annotator",
"chapter_header_styling"]})
f = open("BOOK.html","a")
for div in mydivs:
z = str(div)
f.write(z)
f.close()
print len(mydivs) #outputs 1 if copied div data.
#######################################
#mkchap.py
# coding: utf-8
from bs4 import BeautifulSoup
import requests
r = requests.get("https://codes.iccsafe.org/content/FAC2017")
data = r.text
soup = BeautifulSoup(data, 'html.parser')
a = []
soup.findAll('option',{"value":True})
list = soup.findAll('option')
with open('bchap.txt', 'w') as filehandle:
for l in list:
filehandle.write(l['value'])
filehandle.write("\n")
print l['value']
#with open('bchap.txt', 'w') as filehandle:
# filehandle.write("%s\n" % list)
filehandle.close()
The problem seems to be that you are constructing your url using a wrong base url.
LINK = "https://codes.iccsafe.org/content/FAC2017"
If you take a look at your 1st request you can see this clearly.
print(pop)
print(r.status_code)
Outputs:
https://codes.iccsafe.org/content/FAC2017/content/FAC2017
404
After running the code to populate bchap.txt, its output is
/content/FAC2017
/content/FAC2017/legend
/content/FAC2017/copyright
/content/FAC2017/preface
/content/FAC2017/chapter-1-application-and-administration
/content/FAC2017/chapter-2-scoping-requirements
/content/FAC2017/chapter-3-building-blocks
/content/FAC2017/chapter-4-accessible-routes
/content/FAC2017/chapter-5-general-site-and-building-elements
/content/FAC2017/chapter-6-plumbing-elements-and-facilities
/content/FAC2017/chapter-7-communication-elements-and-features
/content/FAC2017/chapter-8-special-rooms-spaces-and-elements
/content/FAC2017/chapter-9-built-in-elements
/content/FAC2017/chapter-10-recreation-facilities
/content/FAC2017/list-of-figures
/content/FAC2017/fair-housing-accessibility-guidelines-design-guidelines-for-accessible-adaptable-dwellings
/content/FAC2017/advisory
Lets change the base url first and try again.
from bs4 import BeautifulSoup
import requests
LINK = "https://codes.iccsafe.org"
pop = ""
chapters = open("bchap.txt",'r')
a = []
for aline in chapters:
chap = aline
pop = LINK+chap
r = requests.get(pop)
print(pop)
print(r.status_code)
chapters.close()
Outputs:
https://codes.iccsafe.org/content/FAC2017
404
...
why? b'coz of the \n. If we do a
print(repr(pop))
It will output
'https://codes.iccsafe.org/content/FAC2017\n'
You'll have to strip away that \n also. The final code that worked is
from bs4 import BeautifulSoup
import requests
LINK = "https://codes.iccsafe.org"
pop = ""
chapters = open("bchap.txt",'r')
a = []
for aline in chapters:
chap = aline
pop = LINK+chap
r = requests.get(pop.strip())
data = r.text
soup = BeautifulSoup(data, 'html.parser')
mydivs = soup.findAll("div", class_="annotator chapter_header_styling")
f = open("BOOK.html","a")
for div in mydivs:
z = str(div)
f.write(z)
f.close()
chapters.close()
i am trying to save the list that is generated to a file, i see the print out of the list fine but it will not write to the compoundlist.csv file. i am not sure what i am doing wrong, i have tried to write after the list is generated and also during the loop. I have gotten the same result.
import urllib
import urllib.request
from bs4 import BeautifulSoup
import os
import csv
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
compoundlist = []
soup = make_soup("http://www.genome.jp/dbget-bin/www_bget?ko00020")
i = 1
file = open("Compoundlist.csv", "wb")
for record in soup.findAll("nobr"):
compound = ''
if (record.text[0] == "C" and record.text[1] == '0') or (record.text[0] == "C" and record.text[1] == '1'):
compoundlist = "http://www.genome.jp/dbget-bin/www_bget?cpd:" + record.text
file.write(compoundlist)
print(compoundlist)
Try adding the following to the end of your code
file.close()
To flush the open file buffer into the file