I have some lines in a text document I am trying to replace/remove. The document is in the ISO-8859-1 character encoding.
When I try to copy this line into my Python script to replace it, it won't match. If I shorten the line and remove up until the first double quotation mark " it will replace it fine.
i.e.
desc = [x.replace('Random text “^char”:', '') for x in desc]
This will not match. If I enter:
desc = [x.replace('Random text :', '') for x in desc]
It matches fine. I have checked that it isn't the ^ symbol as well.
Clearly Python IDLE is not using the same character set as my text file and is changing the symbol when I paste it into the script. So how do I get my script to look for this line if it doesn't handle the same characters?
Unfortunately, there's no sure-fire way to determine the encoding of a plain text document, although there are packages that can make very good guesses by analyzing the contents of the document. One popular 3rd-party module for encoding detection is chardet. Or you could manually use trial and error with some popular encodings and see what works.
Once you've determined the correct encoding, the replacement operation itself is simple in Python 3. The core idea is to pass the encoding to the open function, so that you can write Unicode string objects to the file, or read Unicode string objects from the file. Here's a short demo. This will work correctly if the encoding of your terminal is set to UTF-8. I've tested it on Python 3.6.0, both in the Bash shell and in idle3.6.
fname = 'test.txt'
encoding = 'cp1252'
data = 'This is some Random text “^char”: for testing\n'
print(data)
# Save the text to file
with open(fname, 'w', encoding=encoding) as f:
f.write(data)
# Read it back in
with open(fname, 'r', encoding=encoding) as f:
text = f.read()
print(text, text == data)
# Perform the replacement
target = 'Random text “^char”:'
out = text.replace(target, 'XXX')
print(out)
output
This is some Random text “^char”: for testing
This is some Random text “^char”: for testing
True
This is some XXX for testing
Related
I'm working with the requests module to scrape text from a website and store it into a txt file using a method like below:
r = requests.get(url)
with open("file.txt","w") as filename:
filename.write(r.text)
With this method, say if "送分200000" was the only string that requests got from url, it would've been decoded and stored in file.txt like below.
\u9001\u5206200000
When I grab the string from file.txt later on, the string doesn't convert back to "送分200000" and instead remains at "\u9001\u5206200000" when I try to print it out. For example:
with open("file.txt", "r") as filename:
mystring = filename.readline()
print(mystring)
Output:
"\u9001\u5206200000"
Is there a way for me to convert this string and others like it back to their original strings with unicode characters?
convert this string and others like it back to their original strings with unicode characters?
Yes, let file.txt content be
\u9001\u5206200000
then
with open("file.txt","rb") as f:
content = f.read()
text = content.decode("unicode_escape")
print(text)
output
送分200000
If you want to know more read Text Encodings in codecs built-in module docs
It's better to use the io module for that. Try and adapt the following code for your problem.
import io
with io.open(filename,'r',encoding='utf8') as f:
text = f.read()
# process Unicode text
with io.open(filename,'w',encoding='utf8') as f:
f.write(text)
Taken from https://www.tutorialspoint.com/How-to-read-and-write-unicode-UTF-8-files-in-Python
I am guessing you are using Windows. When you open a file, you get its default encoding, which is Windows-1252, unless you specify otherwise. Specify the encoding when you open the file:
with open("file.txt","w", encoding="UTF-8") as filename:
filename.write(r.text)
with open("file.txt", "r", encoding="UTF-8") as filename:
mystring = filename.readline()
print(mystring)
That works as you expect regardless of platform.
I am trying to read user credentials from the text file. In the password, there is 'ü' character. When I read from txt. It prints 'l' character. UTF8 does not work for Turkish characters. How can I read?
def get_username_password():
dosya = open("D:\\user.txt","r",encoding="utf8",errors='ignore')
line = dosya.readline()
print(line)
return line.split(",")
eyll,eyll
txt
From the screenshot, it looks like you are using Windows. You probably saved the text file as "ANSI" which is a windows term for "whatever encoding I think is appropriate for the location setting". For Turkish, it's likely Windows-1254.
In python, this encoding is called "cp1254", so the correct code to open the file is:
dosya = open("D:\\user.txt","r", encoding="cp1254")
Also you can try this:
dosya = open("D:\user.txt","r", encoding='utf-8')
I have a piece of python code that reads from a txt file properly, but my colleague gave me another set of files that appears to be of type txt file as well. But when I ran the same python code, each line is read incorrectly.
For the new files, if the line is 240,022414114120,-500,Bauer_HS5,0
It would be read as str:2[]4[]0 []0[]2[]2[]4..... All those little rectangles between each character and the leading question mark characters are all invalid characters.
And it will further get converted to something like this:
[['\xff\xfe2\x004\x000\x00', '\x000\x002\x002\x004\x001\x004\x001\x001\x004\x001\x002\x000\x00', '\x00-\x005\x000\x000\x00',......
However, if I manually create a normal text file and copy/paste the content from the input file, the parsr was able to read each line correctly. So I am thinking the input files are of different type of the normal text file. But the files' suffix are indeed 'txt'.
The files come from a device that regularly sends files to our server. This parser works fine for another device that does the same thing. And the files from both devices are all of type 'txt'.
Each line is read as {{{ for line in self._infile.xreadlines(): }}}
I am very confused why it would behave this way.
My python code is following.
def __init__(self, infile=sys.stdin, outfile=sys.stdout):
if isinstance(infile, basestring):
infile = open(infile)
if isinstance(outfile, basestring):
outfile = open(outfile, "w")
self._infile = infile
self._outfile = outfile
def sort(self):
lines = []
last_second = None
for line in self._infile.xreadlines():
line = line.replace('\r\n', '')
fields = line.split(',')
if len(fields) < 2:
continue
second = fields[1]
if last_second and second != last_second:
lines = sorted(lines, self._sort_lines)
self._outfile.write("".join([','.join(x) for x in lines]))
#self._outfile.write("\r\n")
lines = []
last_second = second
lines.append(fields)
if lines:
lines = sorted(lines, self._sort_lines)
self._outfile.write("".join([','.join(x) for x in lines]))
#self._outfile.write("\r\n")
self._infile.close()
self._outfile.close()
The start of the file you described as coming from your colleague is "\xff\xfe". These two characters make up a "byte order mark" that indicates that the file is encoded with the "UTF-16-LE" encoding (that is, 16-bit Unicode with the lower byte first). Your Python script is reading with an 8-bit encoding (probably whatever your system's default encoding is), so you're seeing lots of extra null characters (the high bytes of the 16-bit characters).
I can't speak to how the file got a different encoding. Windows text editors (like notepad.exe) are somewhat notorious for silently reencoding files in unhelpful ways if you're not careful with them, so it may be that your colleague previewed the file in an editor and then saved it before forwarding it on to you.
Anyway, the simplest fix is probably to reencode the file. There are various utilities to do this on various OSs, or you could write your own easily enough. Here's a quick and dirty function to reencode a file in Python (which will hopefully raise an exception if the encoding parameters are wrong, but perhaps not always):
def renecode_file(filename, from_encoding="UTF-16-LE", to_encoding="ascii"):
with open(filename, "rb") as f:
in_bytes = f.read() # read bytes
text = in_bytes.decode(from_encoding) # decode to unicode
out_bytes = text.encode(to_encoding) # reencode to new encoding
with open(filename, "wb") as f:
f.write(out_bytes) # write back to the file
If the file you get is going to always be encoded in UTF-16, you could change your regular script to decode it automatically. In Python 2.7, I'd suggest using the io module's open function for this (it is the same code that the regular open uses in Python 3). Note however that the file object returned won't support the xreadlines method which has been deprecated for a long time (just iterate over the file directly instead).
I have a python program that requires the user to paste texts into it to process them to the various tasks. Like this:
line=(input("Paste text here: ")).lower()
The pasted text comes from a .txt file. To avoid any issues with the code (since the text contains multiple quotation marks), the user has to do the following: type 3 quotation marks, paste the text, and type 3 quotation marls again.
Can all of the above be avoided by having python read the .txt? and if so, how?
Please let me know if the question makes sense.
In Python2, just use raw_input to receive input as a string. No extra quotation marks on the part of the user are necessary.
line=(raw_input("Paste text here: ")).lower()
Note that input is equivalent to
eval(raw_input(prompt))
and applying eval to user input is dangerous, since it allows the user to evaluate arbitrary Python expressions. A malicious user could delete files or even run arbitrary functions so never use input in Python2!
In Python3, input behaves like raw_input, so there your code would have been fine.
If instead you'd like the user to type the name of the file, then
filename = raw_input("Text filename: ")
with open(filename, 'r') as f:
line = f.read()
Troubleshooting:
Ah, you are using Python3 I see. When you open a file in r mode, Python tries to decode the bytes in the file into a str. If no encoding is specified, it uses locale.getpreferredencoding(False) as the default encoding. Apparently that is not the right encoding for your file. If you know what encoding your file is using, it is best to supply it with the encoding parameter:
open(filename, 'r', encoding=...)
Alternatively, a hackish approach which is not nearly as satisfying is to ignore decoding errors:
open(filename, 'r', errors='ignore')
A third option would be to read the file as bytes:
open(filename, 'rb')
Of course, this has the obvious drawback that you'd then be dealing with bytes like \x9d rather than characters like ·.
Finally, if you'd like some help guessing the right encoding for your file, run
with open(filename, 'rb') as f:
contents = f.read()
print(repr(contents))
and post the output.
You can use the following:
with open("file.txt") as fl:
file_contents = [x.rstrip() for x in fl]
This will result in the variable file_contents being a list, where each element of the list is a line of your file with the newline character stripped off the end.
If you want to iterate over each line of the file, you can do this:
with open("file.txt") as fl:
for line in fl:
# Do something
The rstrip() method gets rid of whitespace at the end of a string, and it is useful for getting rid of the newline character.
Whenever I try to open a .csv file with the python command
fread = open('input.csv', 'r')
it always opens the file with spaces between every single character. I'm guessing it's something wrong with the text file because I can open other text files with the same command and they are loaded correctly. Does anyone know why a text file would load like this in python?
Thanks.
Update
Ok, I got it with the help of Jarret Hardie's post
this is the code that I used to convert the file to ascii
fread = open('input.csv', 'rb').read()
mytext = fread.decode('utf-16')
mytext = mytext.encode('ascii', 'ignore')
fwrite = open('input-ascii.csv', 'wb')
fwrite.write(mytext)
Thanks!
The post by recursive is probably right... the contents of the file are likely encoded with a multi-byte charset. If this is, in fact, the case you can likely read the file in python itself without having to convert it first outside of python.
Try something like:
fread = open('input.csv', 'rb').read()
mytext = fread.decode('utf-16')
The 'b' flag ensures the file is read as binary data. You'll need to know (or guess) the original encoding... in this example, I've used utf-16, but YMMV. This will convert the file to unicode. If you truly have a file with multi-byte chars, I don't recommend converting it to ascii as you may end up losing a lot of the characters in the process.
EDIT: Thanks for uploading the file. There are two bytes at the front of the file which indicates that it does, indeed, use a wide charset. If you're curious, open the file in a hex editor as some have suggested... you'll see something in the text version like 'I.D.|.' (etc). The dot is the extra byte for each char.
The code snippet above seems to work on my machine with that file.
The file is encoded in some unicode encoding, but you are reading it as ascii. Try to convert the file to ascii before using it in python.
Isn't csv a simple txt file with values separated with comma.
Just try to open it with a text editor to see if the file is correctly formed.
To read an encoded file, you can simply replace open with codecs.open.
fread = codecs.open('input.csv', 'r', 'utf-16')
It did never ocurred to me, but as truppo said, it must be something wrong with the file.
Try to open the file in Excel/BrOffice Calc and Save As the file as Csv again.
If the problem persists, try a subset of the data: fist 10/last 10/intermediate 10 lines of the file.
Ok, I got it with the help of Jarret Hardie's post
this is the code that I used to convert the file to ascii
fread = open('input.csv', 'rb').read()
mytext = fread.decode('utf-16')
mytext = mytext.encode('ascii', 'ignore')
fwrite = open('input-ascii.csv', 'wb')
fwrite.write(mytext)
Thanks!
Open the file in binary mode, 'rb'. Check it in a HEX Editor and check for null padding '00'. Open the file in something like Scintilla Text Editor to check the characters present in the file.
Here's the quick and easy way, esp if python won't parse the input correctly
sed 's/ \(.\)/\1/g'