I am writing a program to scrape a Wikipedia table with python. Everything works fine except for some of the characters which seem don't seem to be encoded properly by python.
Here is the code:
import csv
import requests
from BeautifulSoup import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
url = 'https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_A'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'wikitable sortable'})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
outfile = open("./scrapedata.csv", "wb")
writer = csv.writer(outfile)
print list_of_rows
writer.writerows(list_of_rows)
For example Merzbrück is being encoded as Merzbrück.
The issue more or less seems to be with scandics (é,è,ç,à etc). Is there a way I can avoid this?
Thanks in advance for your help.
This is of course an encoding issue. The question is where it is. My suggestion is that you work through each step and look at the raw data to see if you can find out where exactly the encoding issue is.
So, for example, print response.content to see if the symbols are as you expect in the requests object. If so, move on, and check out soup.prettify() to see if the BeautifulSoup object looks ok, then list_of_rows, etc.
All that said, my suspicion is that the issue has to do with writing to csv. The csv documentation has an example of how to write unicode to csv. This answer also might help you with the problem.
For what it's worth, I was able to write the proper symbols to csv using the pandas library (I'm using python 3 so your experience or syntax may be a little different since it looks like you are using python 2):
import pandas as pd
df = pd.DataFrame(list_of_rows)
df.to_csv('scrapedata.csv', encoding='utf-8')
Related
I made a small web scraping bot using Python 3, Currently it is taking the input between classes and thankfully puts them into .csv file, But when i open it i find the part in arabic letters of it like this:
وائل ÙتÙ
I tried arabic resharper but looks like it just does converting in direction and some sort of encoding, When storing the string it represent bad characters as same as the above
Also this code below makes a successful arabic content into text file:
s = "ذهب الطالب الى المدرسة"
with open("file.txt", "w", encoding="utf-8") as myfile:
myfile.write(s)
-Note i'm using Selenium driver to get the content:
content = driver.page_source
soup = BeautifulSoup(content)
Try this, it should work:
soup = BeautifulSoup(content.decode('utf-8'))
Answer after more digging into problem:
1- I found that if i open it with normal windows notepad - i can see the arabic content so python was producing the website content correctly!
2- I used this video as a reference to correctly show data in excel (which the problem was in):
https://www.youtube.com/watch?v=V6AR_Hi7p5Q
i downloaded webpages using wget. now i am trying to extract some data i need from those pages. the problem is with the Japanese words contained in this data. the English words extraction was perfect.
when i try to extract the Japanese words and use them in another app they appear gibberish. during testing diffrent methods there was one solution that fixed only half the japanese words.
what i tried: i tried
from_encoding="utf-8"
which had no effect. also i tried multiple ways to extract the text from the html code like
section.get_text(strip=True)
section.text.strip()
and others, also i tried to encode the generated text using URLencoding which did not work, also i tried using every code i could find on stackoverflow
one of the methods that strangely worked (but not completely) was saving the string in a dictionary then saving it into a JSON then calling the JSON from ANOTHER script. just using the dictionary, as it is, would not work. i have to use JSON as a middle man between two scripts. strange. (not all the words worked)
my question may seem like duplicates of anther question. but that other question is scraping from the internet. and what i am trying to do is extract from an offline source.
here is a simple script explaining the main problem
from bs4 import BeautifulSoup
page = BeautifulSoup(open("page1.html"), 'html.parser', from_encoding="utf-8")
word = page.find('span', {'class' : "radical-icon"})
wordtxt = word.get_text(strip=True)
#then save the word to a file
with open("text.txt", "w", encoding="utf8") as text_file:
text_file.write(wordtxt)
when i open the file i get gibberish characters
here is the part of the html that BeautifulSoup searchs:
<span class="radical-icon" lang="ja">亠</span>
the expected results is to get the symbols inside the text file. or to save them properly in anyway.
is there a better web scraper to use to properly get the utf8?
PS: sorry for bad english
i think i found an answer, just uninstall beautifulsoup4. i dont need it.
python has a builtin way to search for strings, i tried something like this:
import codecs
import re
with codecs.open("page1.html", 'r', 'utf-8') as myfile:
for line in myfile:
if line.find('<span class="radical-icon"') > -1:
result = re.search('<span class="radical-icon" lang="ja">(.*)</span>', line)
s = result.group(1)
with codecs.open("text.txt", 'w', 'utf-8') as textfile:
textfile.write(s)
which is the over complicated and non-pythonic way of doing it. but what works works.
I wrote a Python script to parse the info of shows from a local venue, to display the updated agenda on my home screen.
All prints out ok, but I get an String index out of range error on what I believe to be the last result.
The code is below. I appreciate any help to shed some light on this specific error.
import urllib.request
import bs4 as bs
#READ THE DESIRED URL
sauce = urllib.request.urlopen('http://www.coliseu.pt/agenda/').read()
#PARSE THE HTML
soup = bs.BeautifulSoup(sauce, 'lxml')
#RESULTS WILL HOLD THE INFORMATION PARSED
results = []
#FIND ALL DIVS WITH CLASS BL-BLOCK, GATHER ALL THE TEXT
#AND CHECK FOR THE APROPRIATE RETURNS TO AVOID DUPLICATE FIELDS
for div in soup.find_all('div', class_='bl-block'):
row = [i.text for i in div]
#FOR DEBUGGING
print(row)
if len(row) == 3 and row[2][0].isdigit():
results.append(row)
#SAVE RESULTS TO FILE IF ANY ARE FOUND
if len(results) > 0:
file = open('testfile.txt', 'w')
for line in results:
file.write(line + '\n')
file.close()
I haven't gotten to the .txt file write method because I can't get past this error.
After the #Vahid solved the above problem, I now get encoding errors when trying to save this to a file.
As I'm working with a list, can't write the list directly to the file and any attempt of converting the list to a string keeps boucing the same error.
Can't figure a way to get the data into a file.
//UPDATED
Found out where the problem lies. I'm not able to encode the \u0327 character to UTF-8, can't understand why. The character is "ç" .
Tried encoding it to UTF-8 and to ASCII without success. I'd appreciate any help given.
I'm slightly new to Python and have a question as to why the following code doesn't produce any output in the csv file. The code is as follows:
import csv
import urllib2
url = 'http://www.rba.gov.au/statistics/tables/csv/f17-yields.csv'
response = urllib2.urlopen(url)
cr = csv.reader(response)
for row in cr:
with open("AusCentralbank.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(row)
Cheers.
Edit:
Brien and Albert solved the initial issue I had. However, I now have one further question. When I download the CSV File which I have listed above which is in "http://www.rba.gov.au/statistics/tables/#interest-rates" under Zero-coupon "Interest Rates - Analytical Series - 2009 to Current - F17" and is the F-17 Yields CSV I see that it has 5 workbooks and I actually just want to gather the data in the 5th Workbook. Is there a way I could do this? Cheers.
I could only test my code using Python 3. However, the only diffence should be urllib2, hence I am using urllib.respose for opening the desired url.
The variable html is type bytes and can generally be written to a file in binary mode. Additionally, your source is a csv-file already, so there should be no need to convert it somehow:
#!/usr/bin/env python3
# coding: utf-8
import urllib
url = 'http://www.rba.gov.au/statistics/tables/csv/f17-yields.csv'
response = urllib.request.urlopen(url)
html = response.read()
with open('output.csv', 'wb') as f:
f.write(html)
It is probably because of your opening mode.
According to documentation:
'w' for only writing (an existing file with the same name will be
erased)
You should use append(a) mode to append it to the end of the file.
'a' opens the file for appending; any data written to the file is
automatically added to the end.
Also, since the file you are trying to download is csv file, you don't need to convert it.
#albert had a great answer. I've gone ahead and converted it to the equivalent Python 2.x code. You were doing a bit too much work in your original program; since the file was already a csv you didn't need to do any special work to turn it into a csv.
import urllib2
url = 'http://www.rba.gov.au/statistics/tables/csv/f17-yields.csv'
response = urllib2.urlopen(url)
html = response.read()
with open('AusCentralbank.csv', 'wb') as f:
f.write(html)
I am trying to read an entire web page and assign it to a variable, but am having trouble doing that. The variable seems to only be able to hold the first 512 or so lines of the page source.
I tried using readlines() to just print all lines of the source to the screen, and that gave me the source in its entirety, but I need to be able to parse it with regex, so I need to store it in a variable somehow. Help?
data = urllib2.urlopen(url)
print data
Only gives me about 1/3 of the source.
data = urllib2.urlopen(url)
for lines in data.readlines()
print lines
This gives me the entire source.
Like I said, I need to be able to parse the string with regex, but the part I need isn't in the first 1/3 I'm able to store in my variable.
You probably are looking for beautiful soup: http://www.crummy.com/software/BeautifulSoup/ It's an open source web parsing library for python. Best of luck!
You should be able to use file.read() to read the entire file into a string. That will give you the entire source. Something like
data = urllib2.urlopen(url)
print data.read()
should give you the entire webpage.
From there, don't parse HTML with regex (well-worn post to this effect here), but use a dedicated HTML parser instead. Alternatively, clean up the HTML and convert it to XHTML (for instance with HTML Tidy), and then use an XML parsing library like the standard ElementTree. Which approach is best depends on your application.
Actually, print data should not give you any html content because its just a file pointer. Official documentation https://docs.python.org/2/library/urllib2.html:
This function returns a file-like object
This is what I got :
print data
<addinfourl at 140131449328200 whose fp = <socket._fileobject object at 0x7f72e547fc50>>
readlines() returns list of lines of html source and you can store it in a string like :
import urllib2
data = urllib2.urlopen(url)
l = []
s = ''
for line in data.readlines():
l.append(line)
s = '\n'.join(l)
You can either use list l or string s, according to your need.
I would also recommend to use opensource web parsing libraries for easy work rather than using regex for complete HTML parsing, any way u need regex for url parsing.
If you want to parse over the variable afterwards you might use gazpacho:
from gazpacho import Soup
url = "https://www.example.com"
soup = Soup.get(url)
str(soup)
That way you can perform finds to extract the information you're after!