I am simply trying to run a test script in python using task scheduler, for which I need to make a .bat file of to let it be able to run. This is the current bat file:
"C:\Program Files (x86)\Python38-32\python.exe" "C:\Users\Declan\Documents\_Automation\test.py"
pause
However it is giving me the following
C:\Users\Declan\Documents\_Automation>"ÔǬC:\Program Files (x86)\Python
38-32\python.exe" "C:\Users\Declan\Documents\_Automation\test.py"
The filename, directory name, or volume label syntax is incorrect.
C:\Users\Declan\Documents\_Automation>pause
Press any key to continue . . .
This is likely an encoding issue. Check the character encoding of your batch file.
If you don't know how, simply create a new one, using your favourite text editor and instead of copying and pasting the text from the original source, just rewrite the batch file from scratch.
However, if you have more than just a bare bones text editor (like Notepad++, UltraEdit, etc.) there will be menu options that allow you to inspect and change the encoding of an existing file. UTF-8 without a BOM, or Ansi (varies depending on code page) are options to try.
In case you're wondering: not all text files are created equally. On disk, a text file (like any file) is just a series of bytes and each character of the 'text' is represented by a number of bytes (the exact number depends on the encoding). What character is represented by the byte sequences depends on the chosen character encoding for the file - many encodings will use the same byte sequences for the most common characters, but may use different byte sequences for non-standard characters, or use specific sequences to represent characters that aren't in other encodings. Think about special characters that are needed in some languages, but not in others for example.
Related
I am using Python 3.7.4 version. I want to set the encoding as 'ANSI' at the time of reading a text file and also writing a text file.
I another case I read a file by providing 'utf-8' ( please find code snippet below ) as encoding but in case of 'ANSI' I am not finding any value to provide as encoding.
code snippet :
content = open(fullfile , encoding='utf-8').readlines()
What should be done to set encoding as 'ANSI' in Python ?
There is no "ANSI"-encoding. "ANSI" means "whatever the default single-byte encoding happens to be on your machine" – the term "ANSI" is inherently ambiguous. This means you must specify an actual encoding when reading the file.
For Windows machines in the Western Europe region, "ANSI" typically refers to Windows-1252. Other regions differ, but also your machine configuration might be different.
Python refers to Windows-1252 as cp1252. If that really is the encoding your file is in depends on the file itself, and can only be found out by looking at it.
Often text editors (not Notepad, real text editors) have an option to interpret a file in various encodings. Pick the one that makes the data look right (pay attention to accented characters) and then find out Python's name for it.
ANSI is not actually an encoding, you probably mean Windows-1252, which Python supports as 'cp1252'.
Try one of the ANSI encodings:
encoding='cp1252'
For further information, take a look here
I have a .txt file, which should contain German Umlauts like ä,ö,ß,ü. But, these characters don't apear as such, instead what appears is ä instead of ä, à instead of Ü and so on. It happens because the .txt file is stored in ANSI encoding. Now, when I import this file, with respective columns as Strings, in either SAS (DataStep) or Python (with .read_csv), then these strange characters appear in the .sas7bat and the Python DataFrame as such, instead of proper characters like ä,ö,ü,ß.
One work around to solve this issue is -
Open the file in standard Notepad.
Press 'Save As' and then a window appears.
Then in the drop down, change encoding to UTF-8.
Now, when you import the files, in SAS or Python, then everything is imported correctly.
But, sometimes the .txt files that I have are very big (in GBs), so I cannot open them and do this hack to solve this issue.
I could use .replace() function, to replace these strange characters with the real ones, but there could be some combinations of strange characters that I am not aware of, that's why I wish to avoid that.
Is there any Python library which can automatically translate these strange characters into their proper characters - like ä gets translated to ä and so on?
did you try to use codecs library?
import codecs
your_file= codecs.open('your_file.extension','w','encoding_type')
If the file contains the correct code points, you just have to specify the correct encoding. Python 3 will default to UTF-8 on most sane platforms, but if you need your code to also run on Windows, you probably want to spell out the encoding.
with open(filename, 'r', encoding='utf-8') as f:
# do things with f
If the file actually contains mojibake there is no simple way in the general case to revert every possible way to screw up text, but a common mistake is assuming text was in Latin-1 and convert it to UTF-8 when in fact the input was already UTF-8. What you can do then is say you want Latin-1, and probably make sure you save it in the correct format as soon as you have read it.
with open(filename, 'r', encoding='latin-1') as inp, \
open('newfile', 'w', encoding='utf-8') as outp:
for line in inp:
outp.write(line)
The ftfy library claims to be able to identify and correct a number of common mojibake problems.
I built a Python steganographer that hides UTF-8 text in images and it works fine for it. I was wondering if I could encode complete files in images. For this, the program needs to read all kinds of files. The problem is that not all files are encoded with UTF-8 and therefore, you have to read them with:
file = open('somefile.docx', encoding='utf-8', errors='surrogateescape')
and if you copy it to a new file and read them then it says that the files are not decipherable. I need a way to read all kinds of files and later write them so that they still work. Do you have a way to do this in Python 3?
Thanks.
Change your view. You don't "hide UTF-8 text in images". You hide bytes in images.
These bytes could be - purely accidentally - interpretable as UTF-8-encoded text. But in reality they could be anything.
Reading a file as text with open("...", encoding="...") has the hidden step of decoding the bytes of the file into string. This is convenient when you want to treat the file contents as string in your program.
Skip that hidden decoding step and read the file as bytes: open("...", "rb").
I wanna write a python script that converts file encoding from cp949 to utf8. The file is orginally encoded in cp949.
My script is as follows:
cpstr = open('terms.rtf').read()
utfstr = cpstr.decode('cp949').encode('utf-8')
tmp = open('terms_utf.rtf', 'w')
tmp.write(utfstr)
tmp.close()
But this doesn't change the encoding as I intended.
There are three kinds of RTF, and I have no idea which kind you have. You can tell by opening the file in a plain-text editor, or just using less/more/cat/type/whatever to print it out to your terminal.
First, the easy cases: plaintext RTF.
A plaintext RTF file starts of with {\rtf, and all of the text within it is (as you'd expect) plain text—although sometimes runs of text will be broken up into separate runs with formatting commands—which start with \—in between them. Since all of the formatting commands are pure ASCII, if you convert a plaintext RTF from one charset to another (as long as both are supersets of ASCII, as cp949 and utf-8 both are), it should work fine.
However, the file may also have a formatting command that specifies what character set it's written in. This command looks like \ansicpg949. When an RTF editor like Wordpad opens your file, it will interpret all your nice UTF-8 data as cp949 data and mojibake the hell out of it unless you fix it.
The simplest way to fix it is to figure out what charset your editor wants to put there for UTF-8 files. Maybe it's \ansicpg65001, maybe it's \utf8, maybe it's something completely different. So just save a simple file as a UTF-8 RTF, then look at it in plain text, and see what it has in place of \ansicpg949, and replace the string in your file with the right one. (Note that code page 65001 is not really UTF-8, but it's close, and a lot of Microsoft code assumes they're the same…)
Also, some RTF editors (like Apple's TextEdit) will escape any non-ASCII characters (so, e.g., a é is stored as \'e9), so there's nothing to convert.
Finally, Office Open XML includes an XML spec for something that's called RTF, but isn't really the same thing. I believe many RTF editors can handle this. Fortunately, you can treat this the same way as plaintext RTF—all of the XML tags have pure-ASCII names.
The almost-as-easy case is compressed plaintext RTF. This is the same thing, but compressed with, I believe, zlib. Or it can actually be RTFD (which can be plaintext RTF together with a images and other things in separate files, or actual plain text with formatting runs stored in a separate file) in a .zip archive. Anyway, if you have one of these, the file command on most Unix systems should be able to detect it as "compressed RTF", at which point we can figure out what the specific format is and decompress it, and then you can edit it as plaintext RTF (or RTFD).
Needless to say, if you don't uncompress this first, you won't see any of your familiar text in the file—and you could easily end up breaking it so it can't be decompressed, or decompresses to garbage, by changing arbitrary bytes to different bytes.
Finally, the hard case: binary RTF.
The earliest versions of these were in an undocumented format, although they've been reverse-engineered. The later versions are public specs. Wikipedia has links to the specs. If you want to parse it manually you can, but it's going to be a substantial amount of code, and you're going to have to write it yourself.
A better solution would be to use one of the many libraries on PyPI that can convert RTF (including binary RTF) to other formats, which you can then edit easily.
import codecs
cpstr = codecs.open('terms.rtf','r','cp949').read()
u = cpstr.encode('cp949').decode('utf-8')
tmp = open('terms_utf.rtf', 'w')
tmp.write(u)
tmp.close()
I have a large number of files containing data I am trying to process using a Python script.
The files are in an unknown encoding, and if I open them in Notepad++ they contain numerical data separated by a load of 'null' characters (represented as NULL in white on black background in Notepad++).
In order to handle this, I separate the file by the null character \x00 and retrieve only numerical values using the following script:
stripped_data=[]
for root,dirs,files in os.walk(PATH):
for rawfile in files:
(dirName, fileName)= os.path.split(rawfile)
(fileBaseName, fileExtension)=os.path.splitext(fileName)
h=open(os.path.join(root, rawfile),'r')
line=h.read()
for raw_value in line.split('\x00'):
try:
test=float(raw_value)
stripped_data.append(raw_value.strip())
except ValueError:
pass
However, there are sometimes other unrecognised characters in the file (as far as I have found, always at the very beginning) - these show up in Notepad++ as 'EOT', 'SUB' and 'ETX'. They seem to interfere with the processing of the file in Python - the file appears to end at those characters, even though there is clearly more data visible in Notepad++.
How can I remove all non-ASCII characters from these files prior to processing?
You are opening the file in text mode. That means that the first Ctrl-Z character is considered as an end-of-file character. Specify 'rb' instead of 'r' in open().
I don't know if this will work for sure, but you could try using the IO methods in the codec module:
import codec
inFile = codec.open(<SAME ARGS AS 'OPEN'>, 'utf-8')
for line in inFile.readline():
do_stuff()
You can treat the inFile just like a normal FILE object.
This may or may not help you, but it probably will.
[EDIT]
Basically you'll replace: h=open(os.path.join(root, rawfile),'r') with h=open(os.path.join(root, rawfile),'r', 'utf-8')
The file.read() function will read until EOF.
As you said it stops too early you want to continue reading the file even when hitting an EOF.
Make sure to stop when you have read the entire file. You can do this by checking the position in the file via file.tell() when hitting an EOF and stopping when you hit the file-size (read file-size prior to reading).
As this is rather complex you may want to use file.next and iterate over bytes.
To remove non-ascii characters you can either use a white-list for specific characters or check the read Byte against a range your define.
E.g. is the Byte between x30 and x39 (a number) -> keep it / save it somewhere / add it to a string.
See an ASCII table.