I have a folder of 200 academic journal articles that are saved as individual html files. I want to write a Python program, where the user is asked to input the file name, and the file will open so that it can be processed with beautiful soup. Can anyone help me do this?
In Python 2.x, this could be done as follows:
from bs4 import BeautifulSoup
filename = raw_input('Please enter filename: ')
with open(filename) as f_input:
html = f_input.read()
soup = BeautifulSoup(html, "html.parser")
print soup
Related
I'm learning Python and I'm following this online class lesson.
https://openclassrooms.com/fr/courses/7168871-apprenez-les-bases-du-langage-python/exercises/4173
At the end of the lesson, we're learning the ETL procedure.
Question 3:
I have to load an HTML script and use BeautifulSoup in a Python script.
The problem is there: the only thing I've done when it comes to data mining is with a website, I create a variable that contains the URL link of the website and after that I create a variable soup.
import requests
from bs4 import BeautifulSoup
url = 'https://www.gov.uk/search/news-and-communications'
reponse = requests.get(url)
page = reponse.content
soup = BeautifulSoup(page, 'html.parser')
This is easy because the HTML code is in a URL but how can I do that with a file inside my machine?
I create a new HTML file with the script inside (the file is named TestOC.html)
I create a new Python file.
from bs4 import BeautifulSoup
soup = BeautifulSoup('TestOC.html', 'html.parser')
But the file is not taken. How can I do that?
BeautifulSoup takes the content, not the file name. You could open it yourself and read() it though:
with open('TestOC.html') as f:
content = f.read()
soup = BeautifulSoup(content, 'html.parser')
I want to open a txt file (which contains multiple links) and scrap title using beautifulsoup.
My txt file contains link like this:
https://www.lipsum.com/7845284869/
https://www.lipsum.com/56677788/
https://www.lipsum.com/01127111236/
My code:
import requests as rq
from bs4 import BeautifulSoup as bs
with open('output1.csv', 'w', newline='') as f:
url = open('urls.txt', 'r', encoding='utf8')
request = rq.get(str(url))
soup = bs(request.text, 'html.parser')
title = soup.findAll('title')
pdtitle = {}
for pdtitle in title:
pdtitle.append(pdtitle.text)
f.write(f'{pdtitle}')
I want to open all txt file links and scrap title from the links. The main problem is opening txt file in url variable is not working. How to open a file and save data to csv?
you code isn't working because inside URL is all the URL. you need to run one by one:
import requests as rq
from bs4 import BeautifulSoup as bs
with open(r'urls.txt', 'r') as f:
urls = f.readlines()
with open('output1.csv', 'w', newline='') as f:
for url in urls:
request = rq.get(str(url))
soup = bs(request.text, 'html.parser')
title = soup.findAll('title')
pdtitle = {}
for pdtitle in title:
pdtitle.append(pdtitle.text)
f.write(f'{pdtitle}')
Your urls may not be working because your urls are being read with a return line character: \n. You need to strip the text before putting them in a list.
Also, you are using .find_all('title'), and this will return a list, which is probably not what you are looking for. You probably just want the first title and that's it. In that case, .find('title') would be better. I have provided some possible corrections below.
from bs4 import BeautifulSoup
import requests
filepath = '...'
with open(filepath) as f:
urls = [i.strip() for i in f.readlines()]
titles = []
for url in urls:
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
title = soup.find('title') # Note: will find the FIRST title only
titles.append(title.text) # Grabs the TEXT of the title only, removes HTML
new_csv = open('urls.csv', 'w') # Make sure to prepend with desired location, e.g. 'C:/user/name/urls.csv'
for title in titles:
new_csv.write(title+'\n') # The '\n' ensures a new row is written
new_csv.close()
f.close()
I am trying to scrape a website that I put into a local html file. When I use the find_all() method I can get all the tags' text displayed on the python results. The problem is that I can't get it to display all the text in a .txt file.
from bs4 import BeautifulSoup
def interest_retrieval(filename):
with open(f'{filename}', 'r') as html_file:
content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
interests = soup.find_all('h2')
for interest in interests:
with open ('interest.txt', 'w') as file:
file.write(f'{interest.text}')
print(interest.text)
Python will display all the tags as a text but when I write to the .txt file it only will display the last last tag.
output of txt document
Edit I would also like to do a similar thing but with a docx file. I took Igor's suggested code but changed the parts into what I would need for a docx file. But I'm still having the same issue with the docx file.
from bs4 import BeautifulSoup
import docx
def interest_retrieval(filename):
with open(f'{filename}', 'r') as html_file:
content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
interests = soup.find_all('h2')
with open('interest.txt', 'w') as file:
for interest in interests:
mydoc = docx.Document()
mydoc.add_paragraph(f'{interest.text}')
mydoc.save("C:/Users\satam\PycharmProjects\pythonProject\Web Scraper\list.docx")
print(interest.text)
You reopen the file in write mode in every iteration; this overwrites its previous contents. Either open it just once and place the loop within the with block, or open it with the a mode (a for "append"; open('interest.txt', 'a')).
(The former is likely preferable in this case as it seems there's no reason to keep opening and closing the file again and again while you're continuously writing to it.)
Every iteration rewrites the interest.txt file.
You just need to take the with open... part out of the for loop.
Try replacing this fragment
for interest in interests:
with open ('interest.txt', 'w') as file:
file.write(f'{interest.text}')
print(interest.text)
with the following code:
with open('interest.txt', 'w') as file:
for interest in interests:
file.write(f'{interest.text}')
print(interest.text)
Here is the complete code:
from bs4 import BeautifulSoup
def interest_retrieval(filename):
with open(f'{filename}', 'r') as html_file:
content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
interests = soup.find_all('h2')
with open('interest.txt', 'w') as file:
for interest in interests:
file.write(f'{interest.text}')
print(interest.text)
Edit: Here is the .docx version for the updated question:
from bs4 import BeautifulSoup
import docx
def interest_retrieval(filename):
with open(f'{filename}', 'r') as html_file:
content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
interests = soup.find_all('h2')
mydoc = docx.Document()
for interest in interests:
mydoc.add_paragraph(f'{interest.text}')
print(interest.text)
mydoc.save("C:/Users\satam\PycharmProjects\pythonProject\Web Scraper\list.docx")
N. B. that the docx module can be installed by pip install python-docx.
Could there be a problem with the encoding of the BeautifulSoup copy of my original HTML file?
I'm being told that I cant write to the file, because I must be writing a str instead of none.
Please see code and TypeError below:
#Manipulating HTML and saving changed with BeautifulSoup
#Importing libraries
from bs4 import BeautifulSoup
#Opening the local HTML file
site_html = open(r"C:\Users\rbaden\desktop\KPI_Site\index.html")
#Creating Soup from source HTML file
soup =BeautifulSoup(site_html)
#print(soup.prettify())
#Locate and view specified class in HTML file
test = soup.find_all(class_='test-message-one')
print(test)
#Test place holder for a python variable that should replace the specified class
var = ('Testing...456')
#Replace the class in soup redition of HTML
for i in soup.find_all(class_='test-message-one'):
i.string = var
#overwriting the source HTML file on local drive
with open(r"C:\Users\rbaden\desktop\KPI_Site\index.html") as f:
f.write(soup.content)
First, you need to open the file in w mode.
And, you need to either write str(soup) or soup.prettify():
with open(r"C:\Users\rbaden\desktop\KPI_Site\index.html", "w") as f:
f.write(soup.prettify())
I'm pretty new to interacting with xml, python, and scraping data so bear with me please:
I've got an xml file with my notes saved from evernote. I have been able to load BeautifulSoup and lxml into my python environment. I have also been able to load the xml file and print
Heres my code up until print:
from bs4 import BeautifulSoup
from xml.dom.minidom import parseString
file = open('myNotes.xml','r')
data = file.read()
dom = parseString(data)
print data.toxml()
I didn't include the actual printed file as it contains lots of base 64 code.
What I am trying to accomplish is to extract select xml tags and print them to a new file... help!
This is how to use BeautifulSoup to print xml
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myNotes.xml','r'))
print(soup.prettify())
And to write it to a file:
with open("file.txt", "w") as f:
f.write(soup.prettify())
Now, to extract all of a certain type of tag to a list:
# Extract all of the <a> tags:
tags = soup.find_all('a')