HTML scraper output stuck in utf-8

HTML scraper output stuck in utf-8 - python

I'm working on a scraper for a number of chinese documents. As part of the project I'm trying to scrape the body of the document into a list and then write an html version of the document from that list (the final version will include metadata as well as the text, along with a folder full of individual html files for the documents).
I've managed to scrape the body of the document into a list and then use the contents of that list to create a new HTML document. I can even view the contents when I output the list to a csv (so far so good....).
Unfortunately the HTML document that is output is all "\u6d88\u9664\u8d2b\u56f0\u3001\".
Is there a way to encode the output so that this won't happen? Do I just need to grow up and scrape the page for real (parsing and organizing it <p> by <p> instead of just copying all of the exiting HTML as is) and then build the new HTML page element by element?
Any thoughts would be most appreciated.
from bs4 import BeautifulSoup
import urllib
#csv is for the csv writer
import csv
#initiates the dictionary to hold the output
holder = []
#this is the target URL
target_url = "http://www.gov.cn/zhengce/content/2016-12/02/content_5142197.htm"
data = []
filename = "fullbody.html"
target = open(filename, 'w')
def bodyscraper(url):
#opens the url for read access
this_url = urllib.urlopen(url).read()
#creates a new BS holder based on the URL
soup = BeautifulSoup(this_url, 'lxml')
#finds the body text
body = soup.find('td', {'class':'b12c'})
data.append(body)
holder.append(data)
print holder[0]
for item in holder:
target.write("%s\n" % item)
bodyscraper(target_url)
with open('bodyscraper.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(holder)

As the source htm is utf-8 encoded, when using bs just decoding what urllib returns which will work. I have tested both of html and csv output will show Chinese characters, here is the amended code:
from bs4 import BeautifulSoup
import urllib
#csv is for the csv writer
import csv
#initiates the dictionary to hold the output
holder = []
#this is the target URL
target_url = "http://www.gov.cn/zhengce/content/2016-12/02/content_5142197.htm"
data = []
filename = "fullbody.html"
target = open(filename, 'w')
def bodyscraper(url):
#opens the url for read access
this_url = urllib.urlopen(url).read()
#creates a new BS holder based on the URL
soup = BeautifulSoup(this_url.decode("utf-8"), 'lxml') #decoding urllib returns
#finds the body text
body = soup.find('td', {'class':'b12c'})
target.write("%s\n" % body) #write the whole decoded body to html directly
data.append(body)
holder.append(data)
bodyscraper(target_url)
with open('bodyscraper.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(holder)

Related

open .txt file and save output in csv file

I want to open a txt file (which contains multiple links) and scrap title using beautifulsoup.
My txt file contains link like this:
https://www.lipsum.com/7845284869/
https://www.lipsum.com/56677788/
https://www.lipsum.com/01127111236/
My code:
import requests as rq
from bs4 import BeautifulSoup as bs
with open('output1.csv', 'w', newline='') as f:
url = open('urls.txt', 'r', encoding='utf8')
request = rq.get(str(url))
soup = bs(request.text, 'html.parser')
title = soup.findAll('title')
pdtitle = {}
for pdtitle in title:
pdtitle.append(pdtitle.text)
f.write(f'{pdtitle}')
I want to open all txt file links and scrap title from the links. The main problem is opening txt file in url variable is not working. How to open a file and save data to csv?

you code isn't working because inside URL is all the URL. you need to run one by one:
import requests as rq
from bs4 import BeautifulSoup as bs
with open(r'urls.txt', 'r') as f:
urls = f.readlines()
with open('output1.csv', 'w', newline='') as f:
for url in urls:
request = rq.get(str(url))
soup = bs(request.text, 'html.parser')
title = soup.findAll('title')
pdtitle = {}
for pdtitle in title:
pdtitle.append(pdtitle.text)
f.write(f'{pdtitle}')

Your urls may not be working because your urls are being read with a return line character: \n. You need to strip the text before putting them in a list.
Also, you are using .find_all('title'), and this will return a list, which is probably not what you are looking for. You probably just want the first title and that's it. In that case, .find('title') would be better. I have provided some possible corrections below.
from bs4 import BeautifulSoup
import requests
filepath = '...'
with open(filepath) as f:
urls = [i.strip() for i in f.readlines()]
titles = []
for url in urls:
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
title = soup.find('title') # Note: will find the FIRST title only
titles.append(title.text) # Grabs the TEXT of the title only, removes HTML
new_csv = open('urls.csv', 'w') # Make sure to prepend with desired location, e.g. 'C:/user/name/urls.csv'
for title in titles:
new_csv.write(title+'\n') # The '\n' ensures a new row is written
new_csv.close()
f.close()

Beautiful Soup and the find_all method not listing all tags in text file

I am trying to scrape a website that I put into a local html file. When I use the find_all() method I can get all the tags' text displayed on the python results. The problem is that I can't get it to display all the text in a .txt file.
from bs4 import BeautifulSoup
def interest_retrieval(filename):
with open(f'{filename}', 'r') as html_file:
content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
interests = soup.find_all('h2')
for interest in interests:
with open ('interest.txt', 'w') as file:
file.write(f'{interest.text}')
print(interest.text)
Python will display all the tags as a text but when I write to the .txt file it only will display the last last tag.
output of txt document
Edit I would also like to do a similar thing but with a docx file. I took Igor's suggested code but changed the parts into what I would need for a docx file. But I'm still having the same issue with the docx file.
from bs4 import BeautifulSoup
import docx
def interest_retrieval(filename):
with open(f'{filename}', 'r') as html_file:
content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
interests = soup.find_all('h2')
with open('interest.txt', 'w') as file:
for interest in interests:
mydoc = docx.Document()
mydoc.add_paragraph(f'{interest.text}')
mydoc.save("C:/Users\satam\PycharmProjects\pythonProject\Web Scraper\list.docx")
print(interest.text)

You reopen the file in write mode in every iteration; this overwrites its previous contents. Either open it just once and place the loop within the with block, or open it with the a mode (a for "append"; open('interest.txt', 'a')).
(The former is likely preferable in this case as it seems there's no reason to keep opening and closing the file again and again while you're continuously writing to it.)

Every iteration rewrites the interest.txt file.
You just need to take the with open... part out of the for loop.
Try replacing this fragment
for interest in interests:
with open ('interest.txt', 'w') as file:
file.write(f'{interest.text}')
print(interest.text)
with the following code:
with open('interest.txt', 'w') as file:
for interest in interests:
file.write(f'{interest.text}')
print(interest.text)
Here is the complete code:
from bs4 import BeautifulSoup
def interest_retrieval(filename):
with open(f'{filename}', 'r') as html_file:
content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
interests = soup.find_all('h2')
with open('interest.txt', 'w') as file:
for interest in interests:
file.write(f'{interest.text}')
print(interest.text)
Edit: Here is the .docx version for the updated question:
from bs4 import BeautifulSoup
import docx
def interest_retrieval(filename):
with open(f'{filename}', 'r') as html_file:
content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
interests = soup.find_all('h2')
mydoc = docx.Document()
for interest in interests:
mydoc.add_paragraph(f'{interest.text}')
print(interest.text)
mydoc.save("C:/Users\satam\PycharmProjects\pythonProject\Web Scraper\list.docx")
N. B. that the docx module can be installed by pip install python-docx.

BeautifulSoup, how can I stop my list of web link from overwrite each other?

the code below only give me the last word in the list
import csv
wo = csv.reader(open('WORD.csv') )
row=list(wo)
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
# URl to web scrape from.
# in this example we web scrape lexico
with open("WORD.csv") as f:
for row in csv.reader(f):
for word in row:
# Number of pages plus one
url = "https://www.lexico.pt/{}".format(word)
# opens the connection and downloads html page from url
uClient = uReq(url)
page_html = uClient.read()
# parses html into a soup data structure to traverse html
# as if it were a json data type.
page_soup = soup(page_html, "html.parser")
# finds each product from the store page
containers = page_soup.find("div", {"class": "card card-pl card-pl-significado"})
# name the output file to write to local disk
out_filename = "test.csv"
# opens file, and writes headers
f = open(out_filename, "w")
Word = containers.h2.text
Defention = containers.p.text
f.write("\n" + Word + ", " + Defention + "\n")
f.close()
Please help I tried everything. I am a beginner to BeautifulSoup so sorry for my terrible code format

As I mentioned earlier, I believe that you have already achieved your goal.
In python, the scope is determined by indenting. This defines the area of validity of the local variables. Since you do not follow this continuously in your example, the iteration is already complete when your first request is sent. The loop variable has already been reassigned and contains the result of the last iteration step.
# open files for reading and writing
with open('WORD.csv') as src, open('test.txt', 'w') as dst:
# read row by row
for row in csv.reader(src):
# get words separated by comma
for word in row:
# open connection and create parser with read data
url = f'https://www.lexico.pt/{word}'
resp = urlopen(url)
html = soup(resp.read(), 'html.parser')
# find card/content
card = html.find('div', {'class':'card-pl-significado'})
word = card.h2.text
desc = card.p.text
# write formatted result to file
dst.write(f'{word}, {desc}\n')
Have fun

Python Extract Table from URL to csv

Extracting the "2016-Annual" table in http://www.americashealthrankings.org/api/v1/downloads/131 to a csv. The table has 3 fields- STATE, RANK, VALUE. Getting error with the following:
import urllib2
from bs4 import BeautifulSoup
import csv
url = 'http://www.americashealthrankings.org/api/v1/downloads/131'
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find('2016-Annual', {'class': 'STATE-RANK-VALUE'})
f = open('output.csv', 'w')
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells) == 3:
STATE = cells[0].find(text=True)
RANK = cells[1].find(text=True)
VALUE = cells[2].find(text=True)
print write_to_file
f.write(write_to_file)
f.close()
What am I missing here? Using python 2.7

you code is wrong
this 'http://www.americashealthrankings.org/api/v1/downloads/131' download
csv file.
download csv file to local computer, you can use this file.
#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''
import urllib2
url = 'http://www.americashealthrankings.org/api/v1/downloads/131'
html = urllib2.urlopen(url).read()
with open('output.csv', 'w') as output:
output.write(html)

According to the Beautifulsoup docs, you need to pass a string to be parsed on initialization. However, page = urllib2.urlopen(req) returns a pointer to a page.
Try using soup = BeautifulSoup(page.read(), 'html.parser') instead.
Also, the variable write_to_file doesn't exist.
If this doesn't solve it, please also post which error you get.

The reason its not working is because your pointing to a file that is already a csv - you can literally load that URL in your browser and it will download in CSV file format ---- the table your expecting though, is not at that endpoint - it is at this URL:
http://www.americashealthrankings.org/explore/2016-annual-report
Also - I dont see a class called STATE-RANK-VALUE I only see th headers called state,rank, and ,value

How to go through a list of urls to retrieve page data - Python

In a .py file, I have a variable that's storing a list of urls. How do I properly build a loop to retrieve the code from each url, so that I can extract specific data items from each page?
This is what I've tried so far:
import requests
import re
from bs4 import BeautifulSoup
import csv
#Read csv
csvfile = open("gymsfinal.csv")
csvfilelist = csvfile.read()
print csvfilelist
#Get data from each url
def get_page_data():
for page_data in csvfilelist.splitlines():
r = requests.get(page_data.strip())
soup = BeautifulSoup(r.text, 'html.parser')
return soup
pages = get_page_data()
print pages

By not using the csv module, you are reading the gymsfinal.csv file as text files. Read through the documentation on reading/writing csv files here: CSV File Reading and Writing.
Also you will get only the first page's soup content from your current code. Because get_page_data() function will return after creating the first soup. For your current code, You can yield from the function like,
def get_page_data():
for page_data in csvfilelist.splitlines():
r = requests.get(page_data.strip())
soup = BeautifulSoup(r.text, 'html.parser')
yield soup
pages = get_page_data()
# iterate over the generator
for page in pages:
print pages
Also close the file you just opened.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

HTML scraper output stuck in utf-8 - python

Related

open .txt file and save output in csv file

Beautiful Soup and the find_all method not listing all tags in text file

BeautifulSoup, how can I stop my list of web link from overwrite each other?

Python Extract Table from URL to csv

How to go through a list of urls to retrieve page data - Python

Categories

Resources