Cannot parse through a file with latin-1 encoding - python

I'm trying to parse a large file of tweets from the Stanford Sentiment Database (see here: http://help.sentiment140.com/for-students/), with the following being my code:
def init_process(fin, fout):
outfile = open(fout, 'a')
with open(fin, buffering=200000, encoding='latin-1') as f:
try:
for line in f:
line = line.replace('"', '')
initial_polarity = line.split(',')[0]
if initial_polarity == '0':
initial_polarity = [1, 0]
elif initial_polarity == '4':
initial_polarity = [0, 1]
tweet = line.split(',')[-1]
outline = str(initial_polarity) + ':::' + tweet
outfile.write(outline)
except Exception as e:
print(str(e))
outfile.close()
init_process('training.1600000.processed.noemoticon.csv','train_set.csv')
I've run into this following issue:
'ascii' codec can't encode characters in position 12-14: ordinal not in range(128)
which doesn't make sense since I'm opening the file with a latin-1 encoding. How do I stop this error and successfully parse through the file?

It's probably the outfile encoding that's still ASCII. You should open it with the proper encoding, too (doesn't have to be latin-1, probably utf-8 is more appropriate depending on your environment).
Per comment from Åsmund: the file encoding is locale-specific, you should probably consider changing your locale to something that can handle non-ASCII text.

Related

How to this error ? utf-8' codec can't decode byte 0xef in position 32887: invalid continuation byte

enter image description here
Hello. I am trying to open this file which is in .txt format but it gives me an error.
Sometimes when you don't have uniform files you have to by specific with the correct encoding,
You should indicate it in function open for example,
with open(‘file.txt’, encoding = ‘utf-8’) as f:
etc
also you can detect the file encoding like this:
from chardet import detect
with open(file, 'rb') as f:
rawdata = f.read()
enc = detect(rawdata)['encoding']
with open(‘file.txt’, encoding = enc) as f:
etc
Result:
>>> from chardet import detect
>>>
>>> with open('test.c', 'rb') as f:
... rawdata = f.read()
... enc = detect(rawdata)['encoding']
...
>>> print(enc)
ascii
Python 3.7.0

How to translate encoding by ansi into unicode

When I use the CountVectorizer in sklearn, it needs the file encoding in unicode, but my data file is encoding in ansi.
I tried to change the encoding to unicode using notepad++, then I use readlines, it cannot read all the lines, instead it can only read the last line. After that, I tried to read the line into data file, and write them into the new file by using unicode, but I failed.
def merge_file():
root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
resname='resule_final.txt'
if os.path.exists(resname):
os.remove(resname)
result = codecs.open(resname,'w','utf-8')
num = 1
for back_name in os.listdir(r'd:\\workspace\\minibatchk-means\\data\\20_newsgroups'):
current_dir = root_dir + str(back_name)
for filename in os.listdir(current_dir):
print num ,":" ,str(filename)
num = num+1
path=current_dir + "\\" +str(filename)
source=open(path,'r')
line = source.readline()
line = line.strip('\n')
line = line.strip('\r')
while line !="":
line = unicode(line,"gbk")
line = line.replace('\n',' ')
line = line.replace('\r',' ')
result.write(line + ' ')
line = source.readline()
else:
print 'End file :'+ str(filename)
result.write('\n')
source.close()
print 'End All.'
result.close()
The error message is :UnicodeDecodeError: 'gbk' codec can't decode bytes in position 0-1: illegal multibyte sequence
Oh,I find the way.
First, use chardet to detect string encoding.
Second,use codecs to input or output to the file in the specific encoding.
Here is the code.
import chardet
import codecs
import os
root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
num = 1
failed = []
for back_name in os.listdir("d:\\workspace\\minibatchk-means\\data\\20_newsgroups"):
current_dir = root_dir + str(back_name)
for filename in os.listdir(current_dir):
print num,":",str(filename)
num=num+1
path=current_dir+"\\"+str(filename)
content = open(path,'r').read()
source_encoding=chardet.detect(content)['encoding']
if source_encoding == None:
print '??' , filename
failed.append(filename)
elif source_encoding != 'utf-8':
content=content.decode(source_encoding,'ignore')
codecs.open(path,'w',encoding='utf-8').write(content)
print failed
Thanks for all your help.

Text mining UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1671718: character maps to <undefined>

I have written code to create frequency table. but it is breaking at the line ext_string = document_text.read().lower(. I even put a try and except to catch the error but it is not helping.
import re
import string
frequency = {}
file = open('EVG_text mining.txt', encoding="utf8")
document_text = open('EVG_text mining.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
for word in match_pattern:
try:
count = frequency.get(word,0)
frequency[word] = count + 1
except UnicodeDecodeError:
pass
frequency_list = frequency.keys()
for words in frequency_list:
print (words, frequency[words])
You are opening your file twice, the second time without specifying the encoding:
file = open('EVG_text mining.txt', encoding="utf8")
document_text = open('EVG_text mining.txt', 'r')
You should open the file as follows:
frequencies = {}
with open('EVG_text mining.txt', encoding="utf8", mode='r') as f:
text = f.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text)
...
The second time you were opening your file, you were not defining what encoding to use which is probably why it errored.
The with statement helps perform certain task linked with I/O for a file. You can read more about it here: https://www.pythonforbeginners.com/files/with-statement-in-python
You should probably have a look at error handling as well as you were not enclosing the line that was actually causing the error: https://www.pythonforbeginners.com/error-handling/
The code ignoring all decoding issues:
import re
import string # Do you need this?
with open('EVG_text mining.txt', mode='rb') as f: # The 'b' in mode changes the open() function to read out bytes.
bytes = f.read()
text = bytes.decode('utf-8', 'ignore') # Change 'ignore' to 'replace' to insert a '?' whenever it finds an unknown byte.
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text)
frequencies = {}
for word in match_pattern: # Your error handling wasn't doing anything here as the error didn't occur here but when reading the file.
count = frequencies.setdefault(word, 0)
frequencies[word] = count + 1
for word, freq in frequencies.items():
print (word, freq)
To read a file with some special characters, use encoding as 'latin1' or 'unicode_escape'

Python 2.7 ascii' codec can't encode character u'\xe4

I have experienced a code problem in Python 2.7, I already used UTF-8, but it still got the exception
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 81: ordinal not in range(128)"
My files and contains so many this kind of shit, but for some reason, I'm not allowed to delete it.
desktop,[Search] Store | Automated Titles,google / cpc,Titles > Kesäkaverit,275285048,13
I have tried the below method to avoid, but still, haven't fix it. Can anyone help me ?
1.With "#!/usr/bin/python" in my file header
2.Set setdefaultencoding
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
3.content = unicode(s3core.download_file_to_memory(S3_PROFILE, S3_RAW + file), "utf-8", "ignore")
My code below
content = unicode(s3core.download_file_to_memory(S3_PROFILE, S3_RAW + file), "utf8", "ignore")
rows = content.split('\n')[1:]
for row in rows:
if not row:
continue
try:
# fetch variables
cols = row.rstrip('\n').split(',')
transaction = cols[0]
device_category = cols[1]
campaign = cols[2]
source = cols[3].split('/')[0].strip()
medium = cols[3].split('/')[1].strip()
ad_group = cols[4]
transactions = cols[5]
data_list.append('\t'.join(
['-'.join([dt[:4], dt[4:6], dt[6:]]), country, transaction, device_category, campaign, source,
medium, ad_group, transactions]))
except:
print 'ignoring row: ' + row

python : unicodeEncodeError: 'charpmap' codec can't encode character '\u2026'

I try to analyse some tweets I got from tweeter, but It seems I have a probleme of encoding, if you have any idea..
import json
#Next we will read the data in into an array that we call tweets.
tweets_data_path = 'C:/Python34/TESTS/twitter_data.txt'
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
print(len(tweets_data))#412 tweets
print(tweet)
I got the mistake :
File "C:\Python34\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0]
unicodeEncodeError: 'charpmap' codec can't encode character '\u2026' in position 1345: character maps to undefined
At work, I didn't get the error, but I have python 3.3, does it make a difference, do you think ?
-----EDIT
The comment from #MarkRamson answered my question
You should specify the encoding when opening the file:
tweets_file = open(tweets_data_path, "r", encoding="utf-8-sig")

Categories

Resources