Using pyperclip with BeautifulSoup

Using pyperclip with BeautifulSoup - python

Is it possible to pass BeautifulSoup a HTML I have copied in my clipboard using pyperclip. I have difficulties using requests as the page requires login, and the usual methods for passing cookies to requests doesn't work.

The first approach using Pyperclip would be:
soup = BeautifulSoup(pyperclip.paste(), "html.parser")
You could also save the html file locally with Ctrl+S and import the html file in Python:
with open(r'yourfile.html', "r", encoding='utf-8') as html_file:
soup = BeautifulSoup(html_file.read(), "html.parser")

Related

How to scrape local file with BeautifulSoup

I'm learning Python and I'm following this online class lesson.
https://openclassrooms.com/fr/courses/7168871-apprenez-les-bases-du-langage-python/exercises/4173
At the end of the lesson, we're learning the ETL procedure.
Question 3:
I have to load an HTML script and use BeautifulSoup in a Python script.
The problem is there: the only thing I've done when it comes to data mining is with a website, I create a variable that contains the URL link of the website and after that I create a variable soup.
import requests
from bs4 import BeautifulSoup
url = 'https://www.gov.uk/search/news-and-communications'
reponse = requests.get(url)
page = reponse.content
soup = BeautifulSoup(page, 'html.parser')
This is easy because the HTML code is in a URL but how can I do that with a file inside my machine?
I create a new HTML file with the script inside (the file is named TestOC.html)
I create a new Python file.
from bs4 import BeautifulSoup
soup = BeautifulSoup('TestOC.html', 'html.parser')
But the file is not taken. How can I do that?

BeautifulSoup takes the content, not the file name. You could open it yourself and read() it though:
with open('TestOC.html') as f:
content = f.read()
soup = BeautifulSoup(content, 'html.parser')

Download pdf files without .pdf url

I am trying to download PDF files from this website.
I am new to Python and am currently learning about the software. I have downloaded packages such as urllib and bs4. However, there is no .pdf extension in any of the URLs. Instead, each one has the following format: http://www.smv.gob.pe/ConsultasP8/documento.aspx?vidDoc={.....}.
I have tried to use the soup.find_all command. However, this was not successful.
from urllib import request
from bs4 import BeautifulSoup
import re
import os
import urllib
url="http://www.smv.gob.pe/frm_hechosdeImportanciaDia?data=38C2EC33FA106691BB5B5039DACFDF50795D8EC3AF"
response = request.urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(http://www.smv.gob.pe/ConsultasP8/documento.aspx?)'))
print(links)

This works for me:
import re
import requests
from bs4 import BeautifulSoup
url = "http://www.smv.gob.pe/frm_hechosdeImportanciaDia?data=38C2EC33FA106691BB5B5039DACFDF50795D8EC3AF"
response = requests.get(url).content
soup = BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(http://www.smv.gob.pe/ConsultasP8/documento.aspx?)'))
links = [l['href'] for l in links]
print(links)
Only difference is that I use requests because I'm used to it, and I take the href attribute for each of the returned Tag from BeautifulSoup.

How get a fragment of the code from External HTML (web site) with python

I want to get a fragment from a HTML website with python.
For example from the url http://steven-universe.wikia.com/wiki/Steven_Universe_Wiki I want to get the text in the box "next Episode", as a string. How can I get it?

First of all download BeautifulSoup latest version from here
and requests from here
from bs4 import BeautifulSoup
import requests
con = requests.get(url).content
soup = BeautifulSoup(con)
text = soup.find_all("a",href="/wiki/Gem_Harvest").text;
print(link)

Python How to open an URL and get source code at the same time?

What I had try are as following:
1)
response = urllib2.urlopen(url)
html = response.read()
In this way, I can't open the url in browser.
2)
webbrowser.open(url)
In this way, I can't get source code of the url.
So, how can I open an URL and get source code at the same time?
Thanks for your help.

Have a look at BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
You can request a website and then read the HTML source code from it:
import requests
from bs4 import BeautifulSoup
r = requests.get(YourURL)
soup = BeautifulSoup(r.content)
print soup.prettify()
If you want to read JavaScript, look into Headless Browsers.

Saving content of a webpage using BeautifulSoup

I'm trying to scrape a webpage using BeautifulSoup using the code below:
import urllib.request
from bs4 import BeautifulSoup
with urllib.request.urlopen("http://en.wikipedia.org//wiki//Markov_chain.htm") as url:
s = url.read()
soup = BeautifulSoup(s)
with open("scraped.txt", "w", encoding="utf-8") as f:
f.write(soup.get_text())
f.close()
The problem is that it saves the Wikipedia's main page instead of that specific article. Why the address doesn't work and how should I change it?

The correct url for the page is http://en.wikipedia.org/wiki/Markov_chain:
>>> import urllib.request
>>> from bs4 import BeautifulSoup
>>> url = "http://en.wikipedia.org/wiki/Markov_chain"
>>> soup = BeautifulSoup(urllib.request.urlopen(url))
>>> soup.title
<title>Markov chain - Wikipedia, the free encyclopedia</title>

#alecxe's answer will generate:
**GuessedAtParserWarning**:
No parser was explicitly specified, so I'm using the best
available HTML parser for this system ("html.parser"). This usually isn't a problem,
but if you run this code on another system, or in a different virtual environment, it
may use a different parser and behave differently. The code that caused this warning
is on line 25 of the file crawl.py.
To get rid of this warning, pass the additional argument 'features="html.parser"' to
the BeautifulSoup constructor.
Here is a solution without GuessedAtParserWarning using requests:
# crawl.py
import requests
url = 'https://www.sap.com/belgique/index.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
file = path.join(path.dirname(__file__), 'downl.txt')
# Either print the title/text or save it to a file
print(soup.title)
# download the text
with open(file, 'w') as f:
f.write(soup.text)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using pyperclip with BeautifulSoup - python

Is it possible to pass BeautifulSoup a HTML I have copied in my clipboard using pyperclip. I have difficulties using requests as the page requires login, and the usual methods for passing cookies to requests doesn't work.

Related

How to scrape local file with BeautifulSoup

Download pdf files without .pdf url

How get a fragment of the code from External HTML (web site) with python

Python How to open an URL and get source code at the same time?

Saving content of a webpage using BeautifulSoup

Categories

Resources