from urllib.request import urlopen
from bs4 import BeautifulSoup
#specify the url
wiki = "http://www.bbc.com/urdu"
#Query the website and return the html to the variable 'page'
page = urlopen(wiki)
#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page,"html.parser")
all_links=soup.find_all("a")
for link in all_links:
#print (link.get("href"))
#text=soup.body.get_text()
#print(text)
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text=soup.body.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
text1 = str(text)
text_file = open("C:\\Output.txt", 'w')
text_file.write(text)
text_file.close()
I want to extract data from a news website using beautiful soup. I wrote a code, but it is not giving me the required output. Firstly, I have to process all the links in a page and then extract data from that and save it to a file. Then, more on to next page and extract data and save it and so on… Right now, I was just trying to process links on first page, but it is not giving me the full text and also it is giving me some tags in output.
To extract all links from a website you can try something like this:
data = []
soup = BeautifulSoup(page,"html.parser")
for link in soup.find_all('a', href=True):
data.append(link['href'])
text = '\n'.join(data)
print(text)
And then proceed to save text into a file. After this you need to iterate over data to get all the urls for those websites aswell.
Related
I have a text file which I read in and then I extract the data I require and try sending it to a different new text file, but only the first line gets into the new text file.
import requests
url_file = open('url-test.txt','r')
out_file = open('url.NDJSON','w')
for url in url_file.readlines():
html = requests.get(url).text
out_file.writelines(html)
out_file.close()
try:
for url in url_file.readlines():
html = requests.get(url).text
out_file.write(html)
or
lines = []
for url in url_file.readlines():
html = requests.get(url).text
# verify you are getting the expected data
print(111111, html)
lines.append(html)
out_file.writelines(lines)
either append the string in html or use the writelines statement in for loop
I'm a beginner in Python. I have a CSV-File with links of webpages. First I'd like to read this file and find for each webpage all further links. Then I'd like to read the content from this links and save all content from a webpage to a txt-file. I need for each Webpage a txt-file. My problem now is, that I didn't get some results.
I use Python 3 and work with Anaconda.
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
#1. Read csv-file with list of webpages
with open('F:\Promotion\CSV\Liste_Blogs_Test.csv','r') as csvf:
urls = csv.reader(csvf)
#2. Read links
for url in urls: # Parse through each url in the list.
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, 'html.parser')
#print(soup)
#3. Find for all Websites all other links
for link in soup.findAll('a', href=True):
webpage_links = [link['href']]
#print(webpage_links)
#4. Read content of links and save it to txt for the website
for x in webpage_links:
try:
with open('{}.txt' % url, 'a') as f:
link_content = urlopen(x[0]).read()
soup_1 = BeautifulSoup(link_content,"html.parser")
soup_1.prettify()
print(soup_1)
for a in soup_1.find_all('p'):
print(test.get_text())
content = a.get_text()
print(content)
f.write(content % url)
except ValueError:
pass
I am using the below code to webscrape and kill scripts and style so that I only get text from webpage
link= "https://en.wikipedia.org/wiki/Mark_Zuckerberg"
url = Request(link,headers={'User-Agent': 'Chrome/5.0'})
html = urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
Example: Suppose the soup from website is
<ul><li>Technology entrepreneur</li><li>philanthropist</li></ul></div></td>
</tr><tr><th scope="row">Years active</th><td>
I want it to print
Technology entrepreneur philanthropist Years active
whereas it is printing
Technology entrepreneurphilanthropistYears active
I want it to insert space wherever it is killing script and style elements. Any suggestions in the above code are appreciated. You can run the original url to check.
After you extract the script-tags you could convert the html to a string and use a regex to substitute the tags.
This works for me:
import requests
from bs4 import BeautifulSoup
import re
link= "https://en.wikipedia.org/wiki/Mark_Zuckerberg"
r = requests.get(link, headers={'User-Agent': 'Chrome/5.0'})
html = r.text
soup = BeautifulSoup(html, "lxml") # feel free to use other parsers, e.g. html.parser, I use lxml as it's the fastest one...
for script in soup.find_all('script'):
script.extract()
html = str(soup)
html = re.sub('<.+?>', ' ', html)
html = " ".join(html.strip().split())
print(html)
Edited after it became clear to me what was really asked for...
I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this?
I tried using nltk which works fine however, clean_html and clean_url will be removed moving forward. Is there a way to use soups get_text and get the same result?
I tried looking at these other pages:
BeautifulSoup get_text does not strip all tags and JavaScript
Currently i'm using the nltk's deprecated functions.
EDIT
Here's an example:
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()
I still see the following for CNN:
$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});
/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});
How can I remove the js?
Only other options I found are:
https://github.com/aaronsw/html2text
The problem with html2text is that it's really really slow at times, and creates noticable lag, which is one thing nltk was always very good with.
Based partly on Can I remove script tags with BeautifulSoup?
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.decompose() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
To prevent encoding errors at the end...
import urllib
from bs4 import BeautifulSoup
url = url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text.encode('utf-8'))
My overall goal is isolate tags that contain a certain word in the text and have only those print to a text file.
So far, I have been able to extract particular the tag, in this case the and get those to print to a text file.
My question is once I've got all the text in the extracted, what can I do with it? I am having trouble figuring out a way to isolate a particular word and further trim the text down to only what I need.
Here is what I have so far:
import urllib2
from BeautifulSoup import BeautifulSoup
url = 'http://www.website.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
page = soup.findAll('title')
for element in page:
print element
file_name = raw_input("What do you want to name the file?> ")
text_file = open("/Users/user1/Projects/%s.txt" % file_name, "w")
text_file.write("%s" % page)
text_file.close()
What gets returned to me is:
$<title>food</title>
<title>ball</title>
<title>car</title>
<title>desk</title>
<title>blue food</title>
<title>green food</title>
<title>red ball</title>
How would I get it to only print results that include 'food'?
You can get the contents of an element using .string. If you only wish to include results with food, add a check for that:
for element in page:
if 'food' in element.string:
print element.string
For exemple if you want to extract the data from the page and put it in a CSV file, you can do like that :
import urllib2
from BeautifulSoup import BeautifulSoup
import csv
file_name = raw_input("What do you want to name the file?> ")
c = csv.writer(open("%s.csv" % (file_name), "a"),delimiter=";" ) # Open the CSV File and Write in
url = 'http://www.website.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
page = soup.findAll('title')
for element in page:
element = element.text.encode('utf-8')
c.writerow([element])
You can use your CSV file in Excel or/and text editor software. Can be useful
My code is far away from the perfection but anyway, should work :)