I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this?
I tried using nltk which works fine however, clean_html and clean_url will be removed moving forward. Is there a way to use soups get_text and get the same result?
I tried looking at these other pages:
BeautifulSoup get_text does not strip all tags and JavaScript
Currently i'm using the nltk's deprecated functions.
EDIT
Here's an example:
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()
I still see the following for CNN:
$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});
/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});
How can I remove the js?
Only other options I found are:
https://github.com/aaronsw/html2text
The problem with html2text is that it's really really slow at times, and creates noticable lag, which is one thing nltk was always very good with.
Based partly on Can I remove script tags with BeautifulSoup?
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.decompose() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
To prevent encoding errors at the end...
import urllib
from bs4 import BeautifulSoup
url = url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text.encode('utf-8'))
Related
I made a web scraper to get the informative text of a Wikipedia page. I get the text I want but I want to cut off a big part of the bottom text. I already tried some other solutions but with those, I don't get the headers and white-spaces I need.
import requests
from bs4 import BeautifulSoup
import re
website = "https://nl.wikipedia.org/wiki/Kat_(dier)"
request = requests.get(website)
soup = BeautifulSoup(request.text, "html.parser")
text = list()
text.extend(soup.findAll('mw-content-text'))
text_content = soup.text
text_content = re.sub(r'==.*?==+', '', text_content)
# text_content = text.replace('\n', '')
print(text_content)
Here, soup.text is all the text of the wikipedia page with the class='mw-content-text' printed as a string. This prints the overall text I need but I need to cut off the string where it starts showing the text of the sources. I already tried the replace method but it didn't do anything.
Given this page, I want to cut of what's under the red line in the big string of text I have scraped
I tried something like this, which didn't work:
for content in soup('span', {'class': 'mw-content-text'}):
print(content.text)
text = content.findAll('p', 'a')
for t in text:
print(text.text)```
I also tried this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import requests
website = urlopen("https://nl.wikipedia.org/wiki/Kat_(dier)").read()
soup = BeautifulSoup(website,'lxml')
text = ''
for content in soup.find_all('p'):
text += content.text
text = re.sub(r'\[.*?\]+', '', text)
text = text.replace('\n', '')
# print(text)
but these approaches just gave me an unreadable mess of text. I still want the whitespaces and headers that my base code gives me.
Think it is still a bit abstract but you could get your goal while iterating over all children and break if tag with class appendix appears:
for c in soup.select_one('#mw-content-text > div').find_all(recursive=False):
if c.get('class') and 'appendix' in c.get('class'):
break
print(c.get_text(strip=True))
Example
import requests
from bs4 import BeautifulSoup
website = "https://nl.wikipedia.org/wiki/Kat_(dier)"
request = requests.get(website)
soup = BeautifulSoup(request.text)
for c in soup.select_one('#mw-content-text > div').find_all(recursive=False):
if c.get('class') and 'appendix' in c.get('class'):
break
print(c.get_text(strip=True))
There is likely a more efficient solution but here is a list comprehension that solves your issue:
# the rest of your code
references = [line for line in text_content.split('\n') if line.startswith("β")]
Heres an alternative version that might be easier to understand:
# the rest of your code
# Turn text_content into a list of lines
text_content = text_content.split('\n')
references = []
# Iterate through each line and only save the values that start
# with the symbol used for each reference, on wikipedia: "β"
# ( or "^" for english wikipedia pages )
for line in text_content:
if line.startswith("β"):
references.append(line)
Both scripts will do the same thing.
I have the following script and I would like to retrieve the URL's from a text file rather than an array. I'm new to Python and keep getting stuck!
from bs4 import BeautifulSoup
import requests
urls = ['URL1',
'URL2',
'URL3']
for u in urls:
response = requests.get(u)
data = response.text
soup = BeautifulSoup(data,'lxml')
Could you please be a little more clear about what you want?
Here is a possible answer which might or might not be what you want:
from bs4 import BeautifulSoup
import requests
with open('yourfilename.txt', 'r') as url_file:
for line in url_file:
u = line.strip()
response = requests.get(u)
data = response.text
soup = BeautifulSoup(data,'lxml')
The file was opened with the open() function; the second argument is 'r' to specify we're opening it in read-only mode. The call to open() is encapsulated in a with block so the file is automatically closed as soon as you no longer need it open.
The strip() function removes trailing whitespace (spaces, tabs, newlines) at the beginning and end of every line, for instant ' https://stackoverflow.com '.strip() becomes 'https://stackoverflow.com'.
I am using the below code to webscrape and kill scripts and style so that I only get text from webpage
link= "https://en.wikipedia.org/wiki/Mark_Zuckerberg"
url = Request(link,headers={'User-Agent': 'Chrome/5.0'})
html = urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
Example: Suppose the soup from website is
<ul><li>Technology entrepreneur</li><li>philanthropist</li></ul></div></td>
</tr><tr><th scope="row">Years active</th><td>
I want it to print
Technology entrepreneur philanthropist Years active
whereas it is printing
Technology entrepreneurphilanthropistYears active
I want it to insert space wherever it is killing script and style elements. Any suggestions in the above code are appreciated. You can run the original url to check.
After you extract the script-tags you could convert the html to a string and use a regex to substitute the tags.
This works for me:
import requests
from bs4 import BeautifulSoup
import re
link= "https://en.wikipedia.org/wiki/Mark_Zuckerberg"
r = requests.get(link, headers={'User-Agent': 'Chrome/5.0'})
html = r.text
soup = BeautifulSoup(html, "lxml") # feel free to use other parsers, e.g. html.parser, I use lxml as it's the fastest one...
for script in soup.find_all('script'):
script.extract()
html = str(soup)
html = re.sub('<.+?>', ' ', html)
html = " ".join(html.strip().split())
print(html)
Edited after it became clear to me what was really asked for...
I would like use Python to get some data under a pre tag from an html page.
The html looks like this.
I tried to use Selenium first but it fails to find the element by xpath.
browser = webdriver.Ie()
wait = WebDriverWait(browser, 5)
browser.get('file:\\\my_url.html')
body= wait.until(EC.presence_of_element_located((By.XPATH, "/html/body/pre[2]")))
print(body.text)
I tried to use bs4. However, BeautifulSoup keeps telling me that my browser does not support Frames extension. I am not familiar with bs4 and cannot find any useful solution. Can anyone tell me how to modify the setting of IE browser to successfully read the data? Thanks!
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen
import html2text
url = " " #this html page is on a network drive and can be opened by IE\Chrome\...
html = urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
for script in soup(["script", "style"]):
script.extract() # rip it out
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
>>>This page is designed to be viewed by a browser which supports Frames extension.
This text will be shown by browsers which do not support the Frames extension.
Your pre element is inside a <frame> with the name "glhstry_main", so you need to switch to it first, before accessing your element. Here:
browser = webdriver.Ie()
wait = WebDriverWait(browser, 5)
browser.get('file:\\\my_url.html')
browser.switch_to_frame("glhstry_main") // switching to the frame
body= wait.until(EC.presence_of_element_located((By.XPATH, "/html/body/pre[2]")))
print(body.text)
//do your frame stuff
driver.switch_to.default_content() // switching back to original HTML from the frame
from urllib.request import urlopen
from bs4 import BeautifulSoup
#specify the url
wiki = "http://www.bbc.com/urdu"
#Query the website and return the html to the variable 'page'
page = urlopen(wiki)
#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page,"html.parser")
all_links=soup.find_all("a")
for link in all_links:
#print (link.get("href"))
#text=soup.body.get_text()
#print(text)
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text=soup.body.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
text1 = str(text)
text_file = open("C:\\Output.txt", 'w')
text_file.write(text)
text_file.close()
I want to extract data from a news website using beautiful soup. I wrote a code, but it is not giving me the required output. Firstly, I have to process all the links in a page and then extract data from that and save it to a file. Then, more on to next page and extract data and save it and so on⦠Right now, I was just trying to process links on first page, but it is not giving me the full text and also it is giving me some tags in output.
To extract all links from a website you can try something like this:
data = []
soup = BeautifulSoup(page,"html.parser")
for link in soup.find_all('a', href=True):
data.append(link['href'])
text = '\n'.join(data)
print(text)
And then proceed to save text into a file. After this you need to iterate over data to get all the urls for those websites aswell.