Really new to Python and getting data from the web, so here it goes.
I have been able to pull data from the NYT api and parse the JSON output into a CSV file. However, depending on my search, I may get the following error when I attempt to write a row to the CSV.
UnicodeEncodeError: 'charmap' codec can't encode characters in position 20-21: character maps to
This URL has the data that I am trying to parse into a CSV. (I de-selected "Print pretty results")
I am pretty sure the error is occuring near title:"Spitzer......."
I have tried to search the web, but I can't seem to get an answer. I don't know alot about encoding, but I am guessing the data I retrieve from the JSON records are encoded in some way.
Any help you can provide will be greatly appreciated.
Many thanks in advance,
Brock
You need to check your HTTP headers to see what char encoding they are using when returning the results. My bet is that everything is encoded as utf-8 and when you try to write to CSV, you are implicitly encoding output as ascii.
The ' they are using is not in the ascii char set. You can catch the UnicodeError exception.
Follow the golden rules of encodings.
Decode early into unicode (data.decode('utf-8', 'ignore'))
Use unicode internally.
Encode late - during output - data.encode('ascii', 'ignore'))
You can probably set your CSV writer to use utf-8 encodings when writing.
Note: You should really see what encoding they are giving you before blindly using utf-8 for everything.
Every piece of textual data is encoded. It's hard to tell what the problem is without any code, so the only advice I can give now is: Try decoding the response before parsing it ...
resp = do_request()
## look on the nyt site if they mention the encoding used and use it instead.
decoded = resp.decode('utf-8')
parsed = parse( decoded )
It appears to be trying to decode '/' which is used whenever a slash is used. This can be avoided by making using the string function.
str('http:\/\/www.nytimes.com\/2010\/02\/17\/business\/global\/17barclays.html')
'http:\\/\\/www.nytimes.com\\/2010\\/02\\/17\\/business\\/global\\/17barclays.html'
from there you can use replace.
str('http:\/\/www.nytimes.com\/2010\/02\/17\/business\/global\/17barclays.html').replace('\\', "")
Be careful about nytimes API -- it does not provide you the full body text.
Related
I have converted a kdb query into a dataframe and then uploaded that dataframe to a csv file. This caused an encoding error which I easily fixed by decoding to utf-8. However, there is one column which this did not work for.
"nameFid" is the column which isn't working correctly, it outputs on the CSV file as " b'STRING' "
I am running Python 3.7, any other information needed I will be happy to provide.
Here is my code which decodes the data in the dataframe I get from kdb
for ba in df.dtypes.keys():
if df.dtypes[ba] == 'O':
try:
df[ba] = df[ba].apply(lambda x: x.decode('UTF-8'))
except Exception as e:
print(e)
return df
This worked for every column except "nameFid"
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 6: invalid continuation byte -
This is one error I get but I thought this suggests that the data isn't encoded using UTF-8, which would surely mean all the columns wouldn't work?
When using the try except, it instead prints "'Series' object has no attribute 'decode'".
My goal is to remove the "b''" from the column values, which currently show
" b'STRING' "
I'm not sure what else i need to add. Let me know if you need anything.
Also sorry I am quite new to all of this.
Many encodings are partially compatible from one other. This is mostly due to the prevalence of ASCII so a ton of them will be backward compatible with ASCII but extend it differently. Hence if your other columns only contain stuff like numbers etc they are likely ASCII-only and will work with a lot of different encodings.
The column that raises an error however contains some character outside the normal ASCII range and thus the encoding starts to matter. If you don't know the encoding of the file you can use chardet to try to guess it. Keep in mind that this is just guessing. Decoding using a different encoding may not raise any error however it could result in the wrong characters appearing in the final text so you should always know which encoding to use.
This said, if you are on Linux the standard file utility is often able to give you a rough guess of the encoding used, however for more advanced use cases something like chardet is necessary.
Once you have found the correct encoding, say you found it is latin-1 simply replace the decode('utf-8') with decode('latin-1').
I received a csv file exported from a MySQL database (I think the encoding is latin1 since the language is spanish). Unfortunately the encoding is wrong and I cannot process it at all. If I use file:
$ file -I file.csv
file.csv: text/plain; charset=unknown-8bit
I have tried to read the file in python and convert it to utf-8 like:
r.decode('latin-1').encode("utf-8")
or using mysql_latin1_codec:
r.decode('mysql_latin1').encode('UTF-8')
I am trying to transform the data into json objects. The error comes when I save the file:
'UnicodeEncodeError: 'ascii' codec can't encode characters in position'
Do you know how can I convert it to normal utf-8 chars? Or how can I convert data to a valid json? Thanks!!
I got really good results by using pandas dataframe from Continuum Analytics.
You coud do something like:
import pandas as pd
from pandas import *
con='Your database connection credentials user, password, host, database to use'
data=pd.read_sql_query('SELECT * FROM YOUR TABLE',conn=con)
Then you could do:
data.to_csv('path_with_file_name')
or to convert to JSON:
data.to_json(orient='records')
or if you prefer to customize your json format see the documentation here:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html
Have you tried using the codecs module?:
import codecs
....
codecs.EncodedFile(r, 'latin1').reader.read()
I remember having a similar issue a while back and the answer was something to do with how encoding was done prior to Python 3. Codecs seems to handle this problem relatively elegantly.
As coder mentioned in the question comments, it's difficult to pinpoint the problem without being able to reproduce it so I may be barking up the wrong tree.
You probably have two problems. But let's back off... We can't tell whether the text was imported incorrectly, exported incorrectly, or merely displayed in a goofy way.
First, I am going to discuss "importing"...
Do not try to alter the encoding. Instead live with the encoding. But first, figure out what the encoding is. It could be latin1 or it could be utf8. (Or any of lots of less likely charsets.)
Find out the hex for the incoming file. In Python, the code is something like this for dumping hex (etc) for string u:
for i, c in enumerate(u):
print i, '%04x' % ord(c), unicodedata.category(c),
print unicodedata.name(c)
You can go here to see a list of hex values for all the latin1 characters, together with the utf8 hex. For example, ó is latin1 F3 or utf8 C2B3.
Now, armed with knowing the encoding, tell MySQL that.
LOAD DATA INFILE ...
...
CHARACTER SET utf8 -- or latin1
...;
Meanwhile, it does not matter what CHARACTER SET ... the table or column is defined to be; mysql will transcode if necessary. All Spanish characters are available in latin1 and utf8.
Go to this Q&A .
I suggested that you have two errors, one is the "black diamond" case mentioned there; there other is something else. But... Follow the "Best Practice" mentioned.
Back to you question of "exporting"...
Again, you need to check the hex of the output file. Again it does not matter whether it is latin1 or utf8. However... If the hex is C383C2B3 for simply ó, you have "double encoding". If you have that, check to see that you have removed any manual conversion function calls, and simply told MySQL what's what.
Here are some more utf8+Python tips you might need.
If you need more help, follow the text step-by-step. Show us the code used to move/convert it at each step, and show us the HEX at each step.
I'm using python3 to do some web scraping. I want to save a webpage and convert it to text using the following code:
import urllib
import html2text
url='http://www.google.com'
page = urllib.request.urlopen(url)
html_content = page.read()
rendered_content = html2text.html2text(html_content)
But when I run the code, it reports a type error:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/html2text-2016.4.2-py3.4.egg/html2text/__init__.py", line 127, in feed
data = data.replace("</' + 'script>", "</ignore>")
TypeError: 'str' does not support the buffer interface
Could anyone tell me how to deal with this error? Thank you in advance!
I took the time to investigate this, and it turns out to be easily resolved.
Why You Got This Error
The problem is one of bad input: when you called page.read(), a byte string was returned, rather than a regular string.
Byte strings are Python's way of dealing with unfamiliar character encodings: basically there are characters in the raw text that don't map to Unicode (Python 3's default character encoding).
Because Python doesn't know what encoding to use, Python instead represents such strings using raw bytes - this is how all data is represented internally anyway - and lets the programmer decide what encoding to use.
Regular string methods called on these byte strings - such as replace(), which html2text tried to use - fail because byte strings don't have these methods defined.
Solution
html_content = page.read().decode('iso-8859-1')
Padraic Cunningham's solution in the comments is correct in its essence: you have to first tell Python which character encoding to use to try to map these bytes to correct character set.
Unfortunately, this particular text doesn't use Unicode, so asking it to decode using the UTF-8 encoding throws an error.
The correct encoding to use is actually contained in the request headers itself under the Content-Type header - this is a standard header that all HTTP-compliant server responses are guaranteed to provide.
Simply calling page.info().get_content_charset() returns the value of this header, which in this case is iso-8859-1. From there, you can decode it correctly using iso-8859-1, so that regular tools can operate on it normally.
A More Generic Solution
charset_encoding = page.info().get_content_charset()
html_content = page.read().decode(charset_encoding)
The stream returned by urlopen is indicated as being a bytestream by b as the first character before the quoted string. If you exclude it, as in the appended code it seems to work as input for html2txt.
import urllib
import html2text
url='http://www.google.com'
with urllib.request.urlopen(url) as page:
html_content = page.read()
charset_encoding = page.info().get_content_charset()
rendered_content = html2text.html2text(str(html_content)[1:], charset_encoding)
Revised using suggestions about encoding. Yes, it's a hack, but it runs. Not using str() means the original TypeError problem remains.
I have a script thats looping through a database and doing some beautifulsoup processing on the string along with replacing some text with other text, etc.
This works 100% most of the time, however some html blobs seems to contain unicode text which breaks the script with the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 112: ordinal not in range(128)
I'm not sure what to do in this case, does anyone know of a module / function to force all text in the string to be a standardized utf-8 or something?
All the html blobs in the database came from feedparser (downloading rss feeds, storing in db).
Before you do any further processing with your string variable:
clean_str = unicode(str_var_with_strange_coding, errors='ignore')
The messed up characters are skipped. Not elegant, as you don't try to restore any maybe meaningful values, but effective.
Since you don't want to show us your code, I'm going to give a general answer that hopefully helps you find the problem.
When you first get the data out of the database and fetch it with fetchone, you need to convert it into a unicode object. It is good practice to do this as soon as you have your variable, and then re-encode it only when you output it.
db = MySQLdb.connect()
cur = db.cursor()
cur.execute("SELECT col FROM the_table LIMIT 10")
xml = cur.fetchone()[0].decode('utf-8') # Or whatever encoding the text is in, though we're pretty sure it's utf-8. You might use chardet
After you run xml through BeautifulSoup, you might encode the string again if it is being saved into a file or you might just leave it as a Unicode object if you are re-inserting it into the database.
Make sure you really understand what is the difference between unicode and UTF-8 and that it is not the same (what is a surprise for many). That is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
What is encoding of your DB? Is it really UTF-8 or you only assume that it is? If it contains blobs with with random encodings, then you have problem, because you cannot guess the encoding. When you read from the database, then decode the blob to unicode and use unicode later in your code.
But let assume your base is UTF-8. Then you should use unicode everywhere - decode early, encode late. Use unicode everywhere inside you program, and only decode/encode when you read from or write to the database, display, write to file etc.
Unicode and encoding is a bit pain in Python 2.x, fortunately in python 3 all text is unicode
Regarding BeautifulSoup, use the latest version 4.
Well after a couple more hours googling, I finally came across a solution that eliminated all decode errors. I'm still fairly new to python (heavy php background) and didn't understand character encoding.
In my code I had a .decode('utf-8') and after that did some .replace(str(beatiful_soup_tag),'') statements. The solution ended up being so simple as to change all str() to unicode(). After that, not a single issue.
Answer found on:
http://ubuntuforums.org/showthread.php?t=1212933
I sincerely apologize to the commenters who requested I post the code, what I thought was rock solid and not the issue was quite the opposite and I'm sure they would have caught the issue right away! I'll not make that mistake again! :)
I'm parsing a JSON feed in Python and it contains this character, causing it not to validate.
Is there a way to handle these symbols? Can they be converted or is they're a tidy way to remove them?
I don't even know what this symbol is called or what causes them, otherwise I would research it myself.
EDIT: Stackover Flow is stripping the character so here:
http://files.getdropbox.com/u/194177/symbol.jpg
It's that [?] symbol in "Classic 80s"
That probably means the text you have is in some sort of encoding, and you need to figure out what encoding, and convert it to Unicode with a thetext.decode('encoding') call.
I not sure, but it could possibly be the [?] character, meaning that the display you have there also doesn't know how to display it. That would probably mean that the data you have is incorrect, and that there is a character in there that doesn't exist in the encoding that you are supposed to use. To handle that you call the decode like this: thetext.decode('encoding', 'ignore'). There are other options than ignore, like "replace", "xmlcharrefreplace" and more.
JSON must be encoded in one of UTF-8, UTF-16, or UTF-32. If a JSON file contains bytes which are illegal in its current encoding, it is garbage.
If you don't know which encoding it's using, you can try parsing using my jsonlib library, which includes an encoding-detector. JSON parsed using jsonlib will be provided to the programmer as Unicode strings, so you don't have to worry about encoding at all.