Error while writing Beautifulsoup back over original HTML file - python

Could there be a problem with the encoding of the BeautifulSoup copy of my original HTML file?
I'm being told that I cant write to the file, because I must be writing a str instead of none.
Please see code and TypeError below:
#Manipulating HTML and saving changed with BeautifulSoup
#Importing libraries
from bs4 import BeautifulSoup
#Opening the local HTML file
site_html = open(r"C:\Users\rbaden\desktop\KPI_Site\index.html")
#Creating Soup from source HTML file
soup =BeautifulSoup(site_html)
#print(soup.prettify())
#Locate and view specified class in HTML file
test = soup.find_all(class_='test-message-one')
print(test)
#Test place holder for a python variable that should replace the specified class
var = ('Testing...456')
#Replace the class in soup redition of HTML
for i in soup.find_all(class_='test-message-one'):
i.string = var
#overwriting the source HTML file on local drive
with open(r"C:\Users\rbaden\desktop\KPI_Site\index.html") as f:
f.write(soup.content)

First, you need to open the file in w mode.
And, you need to either write str(soup) or soup.prettify():
with open(r"C:\Users\rbaden\desktop\KPI_Site\index.html", "w") as f:
f.write(soup.prettify())

Related

How to scrape local file with BeautifulSoup

I'm learning Python and I'm following this online class lesson.
https://openclassrooms.com/fr/courses/7168871-apprenez-les-bases-du-langage-python/exercises/4173
At the end of the lesson, we're learning the ETL procedure.
Question 3:
I have to load an HTML script and use BeautifulSoup in a Python script.
The problem is there: the only thing I've done when it comes to data mining is with a website, I create a variable that contains the URL link of the website and after that I create a variable soup.
import requests
from bs4 import BeautifulSoup
url = 'https://www.gov.uk/search/news-and-communications'
reponse = requests.get(url)
page = reponse.content
soup = BeautifulSoup(page, 'html.parser')
This is easy because the HTML code is in a URL but how can I do that with a file inside my machine?
I create a new HTML file with the script inside (the file is named TestOC.html)
I create a new Python file.
from bs4 import BeautifulSoup
soup = BeautifulSoup('TestOC.html', 'html.parser')
But the file is not taken. How can I do that?
BeautifulSoup takes the content, not the file name. You could open it yourself and read() it though:
with open('TestOC.html') as f:
content = f.read()
soup = BeautifulSoup(content, 'html.parser')

How to Extract a Division from Html with BeautifulSoup

I am trying to extract the 'meanings' section of a dictionary entry from a html file using beautifulsoup but it is giving me some trouble. Here is a summary of what I have tried so far:
I right click on the dictionary entry page below and save the webpage to my Python directory as 'aufmachen.html'
https://www.duden.de/rechtschreibung/aufmachen
Within the source code of this webpage, the section that I am trying to extract starts from line 1042 with the expression
I wrote the code below but neither tags nor Bedeutungen contains any search results.
import requests
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
with open("aufmachen.html",encoding="utf8") as f:
doc = BeautifulSoup(f,"html.parser")
tags = doc.body.findAll(text = '<div class="division " id="bedeutungen">')
print(tags)
Bedeutungen = doc.body.findAll("div", {"id": "bedeutungen"})
print(Bedeutungen)
Could you please help me with this problem?
Thanks for your time in advance.
The main bug in your code is that you send BS a file, not a string. Call .read() on your file to get a string.
with open("aufmachen.html", "r",encoding="utf8") as f:
doc = BeautifulSoup(f.read(),"html.parser")
However it seems you want to pull in the HTML file from a URL, not a file on your computer. This can be done like this:
from bs4 import BeautifulSoup
import requests
url = "https://www.duden.de/rechtschreibung/aufmachen"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
Bedeutungen = soup.body.findAll("div", {"id": "bedeutungen"})
print(Bedeutungen)
Your first call to .findAll() didn't work because the text kwarg looks for text inside the tag, not a tag itself. The following works, but there's no particular reason to use this over the other shown above.
tags = soup.body.findAll("div", class_="division", id="bedeutungen")

Beautiful Soup and the find_all method not listing all tags in text file

I am trying to scrape a website that I put into a local html file. When I use the find_all() method I can get all the tags' text displayed on the python results. The problem is that I can't get it to display all the text in a .txt file.
from bs4 import BeautifulSoup
def interest_retrieval(filename):
with open(f'{filename}', 'r') as html_file:
content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
interests = soup.find_all('h2')
for interest in interests:
with open ('interest.txt', 'w') as file:
file.write(f'{interest.text}')
print(interest.text)
Python will display all the tags as a text but when I write to the .txt file it only will display the last last tag.
output of txt document
Edit I would also like to do a similar thing but with a docx file. I took Igor's suggested code but changed the parts into what I would need for a docx file. But I'm still having the same issue with the docx file.
from bs4 import BeautifulSoup
import docx
def interest_retrieval(filename):
with open(f'{filename}', 'r') as html_file:
content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
interests = soup.find_all('h2')
with open('interest.txt', 'w') as file:
for interest in interests:
mydoc = docx.Document()
mydoc.add_paragraph(f'{interest.text}')
mydoc.save("C:/Users\satam\PycharmProjects\pythonProject\Web Scraper\list.docx")
print(interest.text)
You reopen the file in write mode in every iteration; this overwrites its previous contents. Either open it just once and place the loop within the with block, or open it with the a mode (a for "append"; open('interest.txt', 'a')).
(The former is likely preferable in this case as it seems there's no reason to keep opening and closing the file again and again while you're continuously writing to it.)
Every iteration rewrites the interest.txt file.
You just need to take the with open... part out of the for loop.
Try replacing this fragment
for interest in interests:
with open ('interest.txt', 'w') as file:
file.write(f'{interest.text}')
print(interest.text)
with the following code:
with open('interest.txt', 'w') as file:
for interest in interests:
file.write(f'{interest.text}')
print(interest.text)
Here is the complete code:
from bs4 import BeautifulSoup
def interest_retrieval(filename):
with open(f'{filename}', 'r') as html_file:
content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
interests = soup.find_all('h2')
with open('interest.txt', 'w') as file:
for interest in interests:
file.write(f'{interest.text}')
print(interest.text)
Edit: Here is the .docx version for the updated question:
from bs4 import BeautifulSoup
import docx
def interest_retrieval(filename):
with open(f'{filename}', 'r') as html_file:
content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
interests = soup.find_all('h2')
mydoc = docx.Document()
for interest in interests:
mydoc.add_paragraph(f'{interest.text}')
print(interest.text)
mydoc.save("C:/Users\satam\PycharmProjects\pythonProject\Web Scraper\list.docx")
N. B. that the docx module can be installed by pip install python-docx.

Python only writing first line

I have a text file which I read in and then I extract the data I require and try sending it to a different new text file, but only the first line gets into the new text file.
import requests
url_file = open('url-test.txt','r')
out_file = open('url.NDJSON','w')
for url in url_file.readlines():
html = requests.get(url).text
out_file.writelines(html)
out_file.close()
try:
for url in url_file.readlines():
html = requests.get(url).text
out_file.write(html)
or
lines = []
for url in url_file.readlines():
html = requests.get(url).text
# verify you are getting the expected data
print(111111, html)
lines.append(html)
out_file.writelines(lines)
either append the string in html or use the writelines statement in for loop

How can I Extract Specific xml tags from a local xml file using python?

I'm pretty new to interacting with xml, python, and scraping data so bear with me please:
I've got an xml file with my notes saved from evernote. I have been able to load BeautifulSoup and lxml into my python environment. I have also been able to load the xml file and print
Heres my code up until print:
from bs4 import BeautifulSoup
from xml.dom.minidom import parseString
file = open('myNotes.xml','r')
data = file.read()
dom = parseString(data)
print data.toxml()
I didn't include the actual printed file as it contains lots of base 64 code.
What I am trying to accomplish is to extract select xml tags and print them to a new file... help!
This is how to use BeautifulSoup to print xml
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myNotes.xml','r'))
print(soup.prettify())
And to write it to a file:
with open("file.txt", "w") as f:
f.write(soup.prettify())
Now, to extract all of a certain type of tag to a list:
# Extract all of the <a> tags:
tags = soup.find_all('a')

Categories

Resources