Encoding Error While Merging Multiple Text Files in Python - python

I want to merge multiple files using python3. All the files are there in a single folder with .txt as extension.
In the folder, there are files starting with special characters like dot (.) and braces() etc. The code and dataset are there in separate folders. Please help.
What I have tried is as follows:
#encoding:utf-8
import os
from pprint import pprint
path = 'F:/abc/intt'
file_names = os.listdir(path)
with open('F://hi//Datasets//megef_fil.txt', 'w', encoding = 'utf-8') as outfile:
for fname in file_names:
with open(os.path.join(path,fname)) as infile:
for line in infile:
outfile.write(line)
The error in trace which I am facing is something like this.
File "C:\Users\Shanky\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 31: character maps to
I am absolutely not having any clue where I am going wrong. Please help me in fixing this issue. Any help is appreciated.

it's seems to be that you read your inerfile in byte with not the proper encoding, try to do :
with open(os.path.join(path,fname), mode="r", encoding="utf-8") as infile:

Related

Script crashes when trying to read a specific Japanese character from file

I was trying to save some Japanese characters from a text file into a string. Most of the characters like "道" make no problems. But others like "坂" don't work. When I'm trying to read them, my script crashes.
Do I have to use a specific encoding while reading the file?
That's my code btw:
with open(path, 'r') as file:
lines = [line.rstrip() for line in file]
The error is:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 310: character maps to <undefined>
You have to specify the encoding when working with non ASCII, like this:
file = open(filename, encoding="utf8")

codec can't decode byte 0x81

I have this simple bit of code:
file = open(filename, "r", encoding="utf-8")
num_lines = sum(1 for line in open(filename))
I simply want to get the number of lines in the file. However I keep getting this error. I'm thinking of just skipping Python and doing it in C# ;-)
Can anyone help? I added 'utf-8' after searching for the error and read it should fix it. The file is just a simple text file, not an image. Albeit a large file. It's actually a CSV string, but I just want to get an idea of the number of lines before I start processing it.
Many thanks.
in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4344:
character maps to <undefined>
It seems to be an encoding problem.
In your example code, you are opening the file twice, and the second doesn't include the encoding.
Try the following code:
file = open(filename, "r", encoding="utf-8")
num_lines = sum(1 for line in file)
Or (more recent) :
with open(filename, "r", encoding="utf-8") as file:
num_lines = sum(1 for line in file)

Not able to read file due to unicode error in python

I'm trying to read a file and when I'm reading it, I'm getting a unicode error.
def reading_File(self,text):
url_text = "Text1.txt"
with open(url_text) as f:
content = f.read()
Error:
content = f.read()# Read the whole file
File "/home/soft/anaconda/lib/python3.6/encodings/ascii.py", line 26, in
decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 404:
ordinal not in range(128)
Why is this happening? I'm trying to run the same on Linux system, but on Windows it runs properly.
According to the question,
i'm trying to run the same on Linux system, but on Windows it runs properly.
Since we know from the question and some of the other answers that the file's contents are neither ASCII nor UTF-8, it's a reasonable guess that the file is encoded with one of the 8-bit encodings common on Windows.
As it happens 0x92 maps to the character 'RIGHT SINGLE QUOTATION MARK' in the cp125* encodings, used on US and latin/European regions.
So probably the the file should be opened like this:
# Python3
with open(url_text, encoding='cp1252') as f:
content = f.read()
# Python2
import codecs
with codecs.open(url_text, encoding='cp1252') as f:
content = f.read()
There can be two reasons for that to happen:
The file contains text encoded with an encoding different than 'ascii' and, according you your comments to other answers, 'utf-8'.
The file doesn't contain text at all, it is binary data.
In case 1 you need to figure out how the text was encoded and use that encoding to open the file:
open(url_text, encoding=your_encoding)
In case 2 you need to open the file in binary mode:
open(url_text, 'rb')
As it looks, default encoding is ascii while Python3 it's utf-8, below syntax to open the file can be used
open(file, encoding='utf-8')
Check your system default encoding,
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
If it's not UTF-8, reset the encoding of your system.
export LANGUAGE=en_US.UTF-8
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LC_TYPE=en_US.UTF-8
You can use codecs.open to fix this issue with the correct encoding:
import codecs
with codecs.open(filename, 'r', 'utf8' ) as ff:
content = ff.read()

decoding error while reading csv files from folder

I am reading headers of csv files from a folder.
code:
#mypath = folder directory with the csv files
for each_file in listdir(mypath):
with open(mypath +"//"+each_file) as f:
first_line = f.readline().strip().split(",")
Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte
Environment:
Spyder, Python 3
Not able to understand the encoding error since I have not done any encoding.
Try using encoding while opening the file in the with condition. I tried the below code and worked fine for me. Please try different encoding's and see if any of it works
for each_file in listdir(path):
with open(path +"//"+each_file,encoding='utf-8') as f:
first_line = f.readline().strip().split(",")
print(each_file ,' --> ',first_line)
Also, check this link for checking file encoding for CSV. hope it helps.
How to check encoding of CSV file
Happy Coding :)
try using single slash '/'
please try using
with open(mypath +"/"+each_file) as f:
Another problem may be the CSV file contains Unicode, not UTF8. It would be easy if you post sample of CSV file too.
The built in os.path.join provides a convenient way to join two or more paths, without worrying about Platform specific slashes '/' or '\'.
import os
files = os.listdir(path)
for file in files:
with open(os.path.join(path, file), encoding='utf-8') as f:
first_line = str(f.readline()).strip().split(",")
print(file, ' --> ', first_line)

Python CSV file UTF-16 to UTF-8 print error

There is a number of topics on this problem around the web, but I can not seem to find the answer for my specific case.
I have a CSV file. I am not sure what was was done to it, but when I try to open it, I get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
Here is a full Traceback:
Traceback (most recent call last):
File "keywords.py", line 31, in <module>
main()
File "keywords.py", line 28, in main
get_csv(file_full_path)
File "keywords.py", line 19, in get_csv
for row in reader:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10: ordinal not in range(128)
With the help of Stack Overflow, I got it open with:
reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')
Now the problem is that when I am reading the file:
def get_csv(file_full_path):
import csv, codecs
reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')
for row in reader:
print row
I get stuck on Asian symbols:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10: ordinal not in range(128)
I have tried decode, 'encode', unicode() on the string containing that character, but it does not seem help.
for row in reader:
#decoded_row = [element_s.decode('UTF-8') for element_s in row]
#print decoded_row
encoded_row = [element_s.encode('UTF-8') for element_s in row]
print encoded_row
At this point I do not really understand why. If I
>>> print u'\u5a07'
娇
or
>>> print '娇'
娇
it works. Also in terminal, it also works. I have checked The default encoding on terminal and Python shell, it is UTF-8 everywhere. And it prints that symbol easily. I assume that it has something to do with me opening file with codecs using UTF-16.
I am not sure where to go from here. Could anyone help out?
The csv module can not handle Unicode input. It says so specifically on its documentation page:
Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe;
You need to convert your CSV file to UTF-8 so that the module can deal with it:
with codecs.open(file_full_path, 'rU', 'UTF-16') as infile:
with open(file_full_path + '.utf8', 'wb') as outfile:
for line in infile:
outfile.write(line.encode('utf8'))
Alternatively, you can use the command-line utility iconv to convert the file for you.
Then use that re-coded file to read your data:
reader = csv.reader(open(file_full_path + '.utf8', 'rb'), delimiter='\t', quotechar='"')
for row in reader:
print [c.decode('utf8') for c in row]
Note that the columns then need decoding to unicode manually.
Encode errors is what you get when you try to convert unicode characters to 8-bit sequences. So your first error is not an error get when actually reading the file, but a bit later.
You probably get this error because the Python 2 CSV module expects the files to be in binary mode, while you opened it so it returns unicode strings.
Change your opening to this:
reader = csv.reader(open(file_full_path, 'rb'), delimiter='\t', quotechar='"')
And you should be fine. Or even better:
with open(file_full_path, 'rb') as infile:
reader = csv.reader(infile, delimiter='\t', quotechar='"')
# CVS handling here.
However, you can't use UTF-16 (or UTF-32), as the separation characters are two-byte characters in UTF-16, and it will not handle this correctly, so you will need to convert it to UTF-8 first.

Categories

Resources