Killing Scripts and Style elements during Webscraping in Python - python

I am using the below code to webscrape and kill scripts and style so that I only get text from webpage
link= "https://en.wikipedia.org/wiki/Mark_Zuckerberg"
url = Request(link,headers={'User-Agent': 'Chrome/5.0'})
html = urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
Example: Suppose the soup from website is
<ul><li>Technology entrepreneur</li><li>philanthropist</li></ul></div></td>
</tr><tr><th scope="row">Years active</th><td>
I want it to print
Technology entrepreneur philanthropist Years active
whereas it is printing
Technology entrepreneurphilanthropistYears active
I want it to insert space wherever it is killing script and style elements. Any suggestions in the above code are appreciated. You can run the original url to check.

After you extract the script-tags you could convert the html to a string and use a regex to substitute the tags.
This works for me:
import requests
from bs4 import BeautifulSoup
import re
link= "https://en.wikipedia.org/wiki/Mark_Zuckerberg"
r = requests.get(link, headers={'User-Agent': 'Chrome/5.0'})
html = r.text
soup = BeautifulSoup(html, "lxml") # feel free to use other parsers, e.g. html.parser, I use lxml as it's the fastest one...
for script in soup.find_all('script'):
script.extract()
html = str(soup)
html = re.sub('<.+?>', ' ', html)
html = " ".join(html.strip().split())
print(html)
Edited after it became clear to me what was really asked for...

Related

How can I delete a big part of a string from a scraped page?

I made a web scraper to get the informative text of a Wikipedia page. I get the text I want but I want to cut off a big part of the bottom text. I already tried some other solutions but with those, I don't get the headers and white-spaces I need.
import requests
from bs4 import BeautifulSoup
import re
website = "https://nl.wikipedia.org/wiki/Kat_(dier)"
request = requests.get(website)
soup = BeautifulSoup(request.text, "html.parser")
text = list()
text.extend(soup.findAll('mw-content-text'))
text_content = soup.text
text_content = re.sub(r'==.*?==+', '', text_content)
# text_content = text.replace('\n', '')
print(text_content)
Here, soup.text is all the text of the wikipedia page with the class='mw-content-text' printed as a string. This prints the overall text I need but I need to cut off the string where it starts showing the text of the sources. I already tried the replace method but it didn't do anything.
Given this page, I want to cut of what's under the red line in the big string of text I have scraped
I tried something like this, which didn't work:
for content in soup('span', {'class': 'mw-content-text'}):
print(content.text)
text = content.findAll('p', 'a')
for t in text:
print(text.text)```
I also tried this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import requests
website = urlopen("https://nl.wikipedia.org/wiki/Kat_(dier)").read()
soup = BeautifulSoup(website,'lxml')
text = ''
for content in soup.find_all('p'):
text += content.text
text = re.sub(r'\[.*?\]+', '', text)
text = text.replace('\n', '')
# print(text)
but these approaches just gave me an unreadable mess of text. I still want the whitespaces and headers that my base code gives me.
Think it is still a bit abstract but you could get your goal while iterating over all children and break if tag with class appendix appears:
for c in soup.select_one('#mw-content-text > div').find_all(recursive=False):
if c.get('class') and 'appendix' in c.get('class'):
break
print(c.get_text(strip=True))
Example
import requests
from bs4 import BeautifulSoup
website = "https://nl.wikipedia.org/wiki/Kat_(dier)"
request = requests.get(website)
soup = BeautifulSoup(request.text)
for c in soup.select_one('#mw-content-text > div').find_all(recursive=False):
if c.get('class') and 'appendix' in c.get('class'):
break
print(c.get_text(strip=True))
There is likely a more efficient solution but here is a list comprehension that solves your issue:
# the rest of your code
references = [line for line in text_content.split('\n') if line.startswith("↑")]
Heres an alternative version that might be easier to understand:
# the rest of your code
# Turn text_content into a list of lines
text_content = text_content.split('\n')
references = []
# Iterate through each line and only save the values that start
# with the symbol used for each reference, on wikipedia: "↑"
# ( or "^" for english wikipedia pages )
for line in text_content:
if line.startswith("↑"):
references.append(line)
Both scripts will do the same thing.

how to add space before a tag in beatifulsoup

I have the following piece of code:
html = urlopen(req).read()
soup = BeautifulSoup(html, "lxml")
# remove all script and style elements
for script in soup(["script", "style"]):
script.extract()
# get text
text = soup.get_text()
The problem is that if in my html page I have something like
Oxford<br />Laboratory, and
after removing the style, I get OxfordLaboratory
So here is my question: how can I add a space, before all < so that words do not get combined?
As the documentation states:
You can specify a string to be used to join the bits of text together:
# soup.get_text("|")
In your case you'll want a space (" ") as the separator.

Webscraping from a script

I'm trying to extract the language proportion spoken at companies, using python's BeautifulSoup.
Yet, the information seems to come from a script, not from HTML, and I'm having some trouble.
For instance, from the following page, when I try
webpage ="https://www.zippia.com/amazon-com-careers-487/"
page = requests.get(webpage)
soup = BeautifulSoup(page.content, 'lxml')
for links in soup.find_all('div', {'class':'companyEducationDegrees'}):
raw_text = links.get_text()
lines = raw_text.split('\n')
print(lines)
print('-------------------')
I don't get any result while the ideal result should be Spanish 61.1%, French 9,7%, etc
As you already found out the data is put into the page via JS. However, you can still get that data, because the entire data over the comapany is always loaded with the page. You can access this data via requests + BeautifulSoup + json (+ re):
import json
import re
import requests
from bs4 import BeautifulSoup
webpage = "https://www.zippia.com/amazon-com-careers-487/"
page = requests.get(webpage)
soup = BeautifulSoup(page.content, 'lxml')
for script in soup.find_all('script', {'type': 'text/javascript'}):
if 'getCompanyInfo' in script.text:
match = re.search("{[^\n]*}", script.text)
data = json.loads(match.group())
print(data["companyDiversity"]["languages"])
json.dump(data, open("test.json", "w"), indent=2) # Only if you want the data put in a readable format to a file (like if you want to find the path to an entry)

extracting data using beautifulsoup

from urllib.request import urlopen
from bs4 import BeautifulSoup
#specify the url
wiki = "http://www.bbc.com/urdu"
#Query the website and return the html to the variable 'page'
page = urlopen(wiki)
#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page,"html.parser")
all_links=soup.find_all("a")
for link in all_links:
#print (link.get("href"))
#text=soup.body.get_text()
#print(text)
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text=soup.body.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
text1 = str(text)
text_file = open("C:\\Output.txt", 'w')
text_file.write(text)
text_file.close()
I want to extract data from a news website using beautiful soup. I wrote a code, but it is not giving me the required output. Firstly, I have to process all the links in a page and then extract data from that and save it to a file. Then, more on to next page and extract data and save it and so on… Right now, I was just trying to process links on first page, but it is not giving me the full text and also it is giving me some tags in output.
To extract all links from a website you can try something like this:
data = []
soup = BeautifulSoup(page,"html.parser")
for link in soup.find_all('a', href=True):
data.append(link['href'])
text = '\n'.join(data)
print(text)
And then proceed to save text into a file. After this you need to iterate over data to get all the urls for those websites aswell.

BeatifulSoup4 get_text still has javascript

I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this?
I tried using nltk which works fine however, clean_html and clean_url will be removed moving forward. Is there a way to use soups get_text and get the same result?
I tried looking at these other pages:
BeautifulSoup get_text does not strip all tags and JavaScript
Currently i'm using the nltk's deprecated functions.
EDIT
Here's an example:
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()
I still see the following for CNN:
$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});
/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});
How can I remove the js?
Only other options I found are:
https://github.com/aaronsw/html2text
The problem with html2text is that it's really really slow at times, and creates noticable lag, which is one thing nltk was always very good with.
Based partly on Can I remove script tags with BeautifulSoup?
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.decompose() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
To prevent encoding errors at the end...
import urllib
from bs4 import BeautifulSoup
url = url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text.encode('utf-8'))

Categories

Resources