Is it possible to pass BeautifulSoup a HTML I have copied in my clipboard using pyperclip. I have difficulties using requests as the page requires login, and the usual methods for passing cookies to requests doesn't work.
The first approach using Pyperclip would be:
soup = BeautifulSoup(pyperclip.paste(), "html.parser")
You could also save the html file locally with Ctrl+S and import the html file in Python:
with open(r'yourfile.html', "r", encoding='utf-8') as html_file:
soup = BeautifulSoup(html_file.read(), "html.parser")
Related
I'm learning Python and I'm following this online class lesson.
https://openclassrooms.com/fr/courses/7168871-apprenez-les-bases-du-langage-python/exercises/4173
At the end of the lesson, we're learning the ETL procedure.
Question 3:
I have to load an HTML script and use BeautifulSoup in a Python script.
The problem is there: the only thing I've done when it comes to data mining is with a website, I create a variable that contains the URL link of the website and after that I create a variable soup.
import requests
from bs4 import BeautifulSoup
url = 'https://www.gov.uk/search/news-and-communications'
reponse = requests.get(url)
page = reponse.content
soup = BeautifulSoup(page, 'html.parser')
This is easy because the HTML code is in a URL but how can I do that with a file inside my machine?
I create a new HTML file with the script inside (the file is named TestOC.html)
I create a new Python file.
from bs4 import BeautifulSoup
soup = BeautifulSoup('TestOC.html', 'html.parser')
But the file is not taken. How can I do that?
BeautifulSoup takes the content, not the file name. You could open it yourself and read() it though:
with open('TestOC.html') as f:
content = f.read()
soup = BeautifulSoup(content, 'html.parser')
I am trying to download PDF files from this website.
I am new to Python and am currently learning about the software. I have downloaded packages such as urllib and bs4. However, there is no .pdf extension in any of the URLs. Instead, each one has the following format: http://www.smv.gob.pe/ConsultasP8/documento.aspx?vidDoc={.....}.
I have tried to use the soup.find_all command. However, this was not successful.
from urllib import request
from bs4 import BeautifulSoup
import re
import os
import urllib
url="http://www.smv.gob.pe/frm_hechosdeImportanciaDia?data=38C2EC33FA106691BB5B5039DACFDF50795D8EC3AF"
response = request.urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(http://www.smv.gob.pe/ConsultasP8/documento.aspx?)'))
print(links)
This works for me:
import re
import requests
from bs4 import BeautifulSoup
url = "http://www.smv.gob.pe/frm_hechosdeImportanciaDia?data=38C2EC33FA106691BB5B5039DACFDF50795D8EC3AF"
response = requests.get(url).content
soup = BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(http://www.smv.gob.pe/ConsultasP8/documento.aspx?)'))
links = [l['href'] for l in links]
print(links)
Only difference is that I use requests because I'm used to it, and I take the href attribute for each of the returned Tag from BeautifulSoup.
I want to get a fragment from a HTML website with python.
For example from the url http://steven-universe.wikia.com/wiki/Steven_Universe_Wiki I want to get the text in the box "next Episode", as a string. How can I get it?
First of all download BeautifulSoup latest version from here
and requests from here
from bs4 import BeautifulSoup
import requests
con = requests.get(url).content
soup = BeautifulSoup(con)
text = soup.find_all("a",href="/wiki/Gem_Harvest").text;
print(link)
What I had try are as following:
1)
response = urllib2.urlopen(url)
html = response.read()
In this way, I can't open the url in browser.
2)
webbrowser.open(url)
In this way, I can't get source code of the url.
So, how can I open an URL and get source code at the same time?
Thanks for your help.
Have a look at BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
You can request a website and then read the HTML source code from it:
import requests
from bs4 import BeautifulSoup
r = requests.get(YourURL)
soup = BeautifulSoup(r.content)
print soup.prettify()
If you want to read JavaScript, look into Headless Browsers.
I'm trying to scrape a webpage using BeautifulSoup using the code below:
import urllib.request
from bs4 import BeautifulSoup
with urllib.request.urlopen("http://en.wikipedia.org//wiki//Markov_chain.htm") as url:
s = url.read()
soup = BeautifulSoup(s)
with open("scraped.txt", "w", encoding="utf-8") as f:
f.write(soup.get_text())
f.close()
The problem is that it saves the Wikipedia's main page instead of that specific article. Why the address doesn't work and how should I change it?
The correct url for the page is http://en.wikipedia.org/wiki/Markov_chain:
>>> import urllib.request
>>> from bs4 import BeautifulSoup
>>> url = "http://en.wikipedia.org/wiki/Markov_chain"
>>> soup = BeautifulSoup(urllib.request.urlopen(url))
>>> soup.title
<title>Markov chain - Wikipedia, the free encyclopedia</title>
#alecxe's answer will generate:
**GuessedAtParserWarning**:
No parser was explicitly specified, so I'm using the best
available HTML parser for this system ("html.parser"). This usually isn't a problem,
but if you run this code on another system, or in a different virtual environment, it
may use a different parser and behave differently. The code that caused this warning
is on line 25 of the file crawl.py.
To get rid of this warning, pass the additional argument 'features="html.parser"' to
the BeautifulSoup constructor.
Here is a solution without GuessedAtParserWarning using requests:
# crawl.py
import requests
url = 'https://www.sap.com/belgique/index.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
file = path.join(path.dirname(__file__), 'downl.txt')
# Either print the title/text or save it to a file
print(soup.title)
# download the text
with open(file, 'w') as f:
f.write(soup.text)