chdir modifying the path in Python - python

I've got a program that reads strings with special characters (used in spanish) from a file. I then use chdir to change to a directory which its name is the string.
For example in a file called "names.txt" I got the following
Tableta
Música
.
.
etc
That file is encoded in utf-8 so that I read it from python as follows
f=open("names.txt","r",encoding="utf-8")
names=f.readlines()
f.close()
And it does read all successfully
print(names)
output:
['Tableta\n','Música\n', ...etc]
Problem arises when I want to change to the first directory (the first name 'Tableta',without the newline character)
chdir(names[0][:-1])
I get the following error
FileNotFoundError: [WinError 2] The system cannot find the file specified: "\ufeffTableta"
And it does only happen with the first name, which was very odd to me. With other names it is able to change directories whether they have special characters or not
I supposed it had to do something with the encoding, because of that '\ufeff' extra added character. So I changed the "names.txt" file to ANSI coding and deleted all the special characters so that I could read it with python, and it worked. But thing is that I need to have that file in utf-8 coding so that I can read special characters. Is there any way I can fix this? Why is python adding that '\ufeff' character to the string and only with the first name?

Your file "names.txt" has a Byte-Order Mask (BOM). To remove it, open the file with the following decoder:
f = open("names.txt", encoding="utf-8-sig")
As a side note, it is safer to strip a file name: names[0].strip() instead of names[0][:-1].

The beginning of your file has the unicode BOM. Skip first character when reading your file or open it with utf-8-sig encoding.

Related

How to filter Unicode characters [duplicate]

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`
The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")
If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)
Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary
As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)
TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).
Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed
for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")
For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"
Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.
def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))
In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.
for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

Open multiple files from a file

I need to open a file that have multiple absolute file directories.
EX:
Layer 1 = C:\User\Files\Menu\Menu.snt
Layer 2 = C:\User\Files\N0 - Vertical.snt
The problem is that when I try to open C:\User\Files\Menu\Menu.snt python doesn't like \U or \N
I could open using r"C:\User\Files\Menu\Menu.snt" but I can't automate this process.
file = open(config.txt, "r").read()
list = []
for line in file.split("\n"):
list.append(open(line.split("=",1)[1]).read())
It prints out:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 33-34: malformed \N character escape
The backslash character \ is used as an escape character by the Python interpreter in order to provide special characters.
For example, \n is a "New Line" character, like you would get from pressing the Return key on your keyboard.
So if you are trying to read something like newFolder1\newFolder2, the interpreter reads it as:
newFolder1
ewFolder2
where the New Line character has been inserted between the two lines of text.
You already mentioned one workaround: using raw strings like r'my\folder\structure' and I'm a little curious why this can't be automated.
If you can automate it, you could try replacing all instances of a single backslash (\) with a double backslash (\\) in your file paths and that should work.
Alternatively, you can try looking in the os module and dynamically building your paths using os.path.join(), along with the os.sep operator.
One final point: You can save yourself some effort by replacing:
list.append(open(line.split("=",1)[1]).read())
by
list = open(line.split("=",1)[1]).readlines()
here is my solution:
file = open("config.txt", "r").readlines()
list = [open(x.split("=")[1].strip(), 'r').read() for x in file]
readlines creates a list that contains all lines in file, there is no need to split the whole string.

How to check each line of file for UTF-8 and write in another file?

I would like to know how I can write to another file on live the lines which are utf-8 encoded. I have a folder containing number of files. I cannot go and check each and every file for UTF-8 character.
I have tried this code:
import codecs
try:
f = codecs.open(filename, encoding='utf-8', errors='strict')
for line in f:
pass
print "Valid utf-8"
except UnicodeDecodeError:
print "invalid utf-8"
This check the whole while is UTF-8 verified or not. But I am trying to check each and every line of the file in a folder and write those lines which are UTF-8 character encoded.
I would like to delete the lines in my file which are not UTF-8 encoded. If while reading line the program get to know that the line is UTF-8 then it should move on to next line, else delete the line which is not UTF-8. I think now it is clear.
I would like to know how I can do it with the help of Python. Kindly let me know.
I am not looking to convert them, but to delete them. Or write to another file the UTF-8 satisfied line from the files.
This article will be of help about how to process text files on Python 3
Basically if you use:
open(fname, encoding="utf-8", errors="strict")
It will raise an exception if the file is not utf-8 encoded, but you can change the errors handling param for read the file and apply your algorithm for exclude lines.
By example:
open(fname, encoding="utf-8", errors="replace")
Will replace non utf-8 characters by a ? symbol.
As #Leon says, you need to consider that Chinese and/or Arabic characters can be utf-8 valid.
If you want a more strict character set you can try to open your file using a latin-1 or a ascii encoding (takin into account that utf-8 and latin-1 are ASCII compatible)
You need to take in count that there are so many character encoding types, and they can be not ASCII compatibles. Is very dificult to read properly text files if you dont know its encoding type, the chardet module can help on that, but is not 100% reliable.

Remove byte order mark from objects in a list

I am using Python (3.4, on Windows 7) to download a set of text files, and when I read (and write, after modifications) these files appear to have a few byte order marks (BOM) among the values that are retained, primarily UTF-8 BOM. Eventually I use each text file as a list (or a string) and I cannot seem to remove these BOM. So I ask whether it is possible to remove the BOM?
For more context, the text files were downloaded from a public ftp source where users upload their own documents, and thus the original encoding is highly variable and unknown to me. To allow the download to run without error, I specified encoding as UTF-8 (using latin-1 would give errors). So it's not a mystery to me that I have the BOM, and I don't think an up-front encoding/decoding solution is likely to be answer for me (Convert UTF-8 with BOM to UTF-8 with no BOM in Python) - it actually appears to make the frequency of other BOM increase.
When I modify the files after download, I use the following syntax:
with open(t, "w", encoding='utf-8') as outfile:
with open(f, "r", encoding='utf-8') as infile:
text = infile.read
#Arguments to make modifications follow
Later on, after the "outfiles" are read in as a list I see that some words have the UTF-8 BOM, like \ufeff. I try to remove the BOM using the following list comprehension:
g = list_outfile #Outfiles now stored as list
g = [i.replace(r'\ufeff','') for i in g]
While this argument will run, unfortunately the BOM remain when, for example, I print the list (I believe I would have a similar issue even if I tried to remove BOM from strings and not lists: How to remove this special character?). If I put a normal word (non-BOM) in the list comprehension, that word will be replaced.
I do understand that if I print the list object by object that the BOM will not appear (Special national characters won't .split() in Python). And the BOM is not in the raw text files. But I worry that those BOM will remain when running later arguments for text analysis and thus any object that appears in the list as \ufeffword rather than word will be analyzed as \ufeffword.
Again, is it possible to remove the BOM after the fact?
The problem is that you are replacing specific bytes, while the representation of your byte order mark might be different, depending on the encoding of your file.
Actually checking for the presence of a BOM is pretty straightforward with the codecs library. Codecs has the specific byte order marks for different UTF encodings. Also, you can get the encoding automatically from an opened file, no need to specify it.
Suppose you are reading a csv file with utf-8 encoding, which may or may not use a byte order mark. Then you could go about like this:
import codecs
with open("testfile.csv", "r") as csvfile:
line = csvfile.readline()
if line.__contains__(codecs.BOM_UTF8.decode(csvfile.encoding)):
# A Byte Order Mark is present
line = line.strip(codecs.BOM_UTF8.decode(csvfile.encoding))
print(line)
In the output resulting from the code above you will see the output without byte order mark. To further improve on this, you could also restrict this check to be only done on the first line of a file (because that is where the byte order mark always resides, it is the first few bytes of the file).
Using strip instead of replace won't replace anything and won't actually do anything if the indicated byte order mark is not present. So you may even skip the manual check for byte-order-mark altogether and just run the strip method on the entire contents of the file:
import codecs
with open("testfile.csv", "r") as csvfile:
with open("outfile.csv", "w") as outfile:
outfile.write(csvfile.read().strip(codecs.BOM_UTF8.decode(csvfile.encoding)))
Voila, you end up with 'outfile.csv' containing the exact contents of the original (testfile.csv) without the Byte Order Mark.

Removing unknown characters from a text file

I have a large number of files containing data I am trying to process using a Python script.
The files are in an unknown encoding, and if I open them in Notepad++ they contain numerical data separated by a load of 'null' characters (represented as NULL in white on black background in Notepad++).
In order to handle this, I separate the file by the null character \x00 and retrieve only numerical values using the following script:
stripped_data=[]
for root,dirs,files in os.walk(PATH):
for rawfile in files:
(dirName, fileName)= os.path.split(rawfile)
(fileBaseName, fileExtension)=os.path.splitext(fileName)
h=open(os.path.join(root, rawfile),'r')
line=h.read()
for raw_value in line.split('\x00'):
try:
test=float(raw_value)
stripped_data.append(raw_value.strip())
except ValueError:
pass
However, there are sometimes other unrecognised characters in the file (as far as I have found, always at the very beginning) - these show up in Notepad++ as 'EOT', 'SUB' and 'ETX'. They seem to interfere with the processing of the file in Python - the file appears to end at those characters, even though there is clearly more data visible in Notepad++.
How can I remove all non-ASCII characters from these files prior to processing?
You are opening the file in text mode. That means that the first Ctrl-Z character is considered as an end-of-file character. Specify 'rb' instead of 'r' in open().
I don't know if this will work for sure, but you could try using the IO methods in the codec module:
import codec
inFile = codec.open(<SAME ARGS AS 'OPEN'>, 'utf-8')
for line in inFile.readline():
do_stuff()
You can treat the inFile just like a normal FILE object.
This may or may not help you, but it probably will.
[EDIT]
Basically you'll replace: h=open(os.path.join(root, rawfile),'r') with h=open(os.path.join(root, rawfile),'r', 'utf-8')
The file.read() function will read until EOF.
As you said it stops too early you want to continue reading the file even when hitting an EOF.
Make sure to stop when you have read the entire file. You can do this by checking the position in the file via file.tell() when hitting an EOF and stopping when you hit the file-size (read file-size prior to reading).
As this is rather complex you may want to use file.next and iterate over bytes.
To remove non-ascii characters you can either use a white-list for specific characters or check the read Byte against a range your define.
E.g. is the Byte between x30 and x39 (a number) -> keep it / save it somewhere / add it to a string.
See an ASCII table.

Categories

Resources