Python: convert strings containing unicode code point back into normal characters - python

I'm working with the requests module to scrape text from a website and store it into a txt file using a method like below:
r = requests.get(url)
with open("file.txt","w") as filename:
filename.write(r.text)
With this method, say if "送分200000" was the only string that requests got from url, it would've been decoded and stored in file.txt like below.
\u9001\u5206200000
When I grab the string from file.txt later on, the string doesn't convert back to "送分200000" and instead remains at "\u9001\u5206200000" when I try to print it out. For example:
with open("file.txt", "r") as filename:
mystring = filename.readline()
print(mystring)
Output:
"\u9001\u5206200000"
Is there a way for me to convert this string and others like it back to their original strings with unicode characters?

convert this string and others like it back to their original strings with unicode characters?
Yes, let file.txt content be
\u9001\u5206200000
then
with open("file.txt","rb") as f:
content = f.read()
text = content.decode("unicode_escape")
print(text)
output
送分200000
If you want to know more read Text Encodings in codecs built-in module docs

It's better to use the io module for that. Try and adapt the following code for your problem.
import io
with io.open(filename,'r',encoding='utf8') as f:
text = f.read()
# process Unicode text
with io.open(filename,'w',encoding='utf8') as f:
f.write(text)
Taken from https://www.tutorialspoint.com/How-to-read-and-write-unicode-UTF-8-files-in-Python

I am guessing you are using Windows. When you open a file, you get its default encoding, which is Windows-1252, unless you specify otherwise. Specify the encoding when you open the file:
with open("file.txt","w", encoding="UTF-8") as filename:
filename.write(r.text)
with open("file.txt", "r", encoding="UTF-8") as filename:
mystring = filename.readline()
print(mystring)
That works as you expect regardless of platform.

Related

how to change byte type in pythone

i have a small problem and it caused me a lot of trubble. basicly i want to convert an immage to bytes than store string wersion of those bytes in an txt file and than read file contents and transform it into bytes and than into image. i've goten first part of this kinda ready (it works but it's made quickly and badly) but the conversion from string to byte gives me problem.
when i read image bytes it's something like this: b'GIF89aP\x00P\x00\xe3'
but when i read it from txt by 'rb' or just transform str to byte it gives me this: b'GIF89aP\\x00P\\x00\\xe3'
and with this i can't write it to an immage.
so i've tried to read and learn anything about this but i couldn't find anything that would help.
the code is here and i know it's really messy but i just need it to work
file = open('p.gif', 'rb')
image = file.read()
str_b = str(image)
leng = len(str_b)
print(leng)
str_b = str_b[:0] + str_b[0+2:]
leng =- 1
str_b = str_b[:leng]
print(image)
#a = open('bytearray', 'w+')
#a.write(str_b)
#a.close
a = open('bytearray', 'r')
a = a.read()
temp = a.encode('utf-8')
print(temp)
#b = open('check', 'w+')
#b.write(str(string))
#print(string)
image_result = open('decoded.jpg', 'wb') # create a writable image and write the decoding result
image_result.write(temp)
basicly my goal right now is to get bytes that look like this: b'GIF89aP\x00P\x00\xe3'
Please do not use eval like suggested above, eval has serious security vulnerabilities and will execute any python code you pass within it. You could accidentally read a text file that has code to reformat the disk and it will just execute, this is just an example but you get my point its bad practice and just results in more problems see https://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html if you want some examples on why eval is bad
anyways lets try to fix your code
instead of converting your byte array to string by wrapping it in the str() method I would suggest you use .decode() and .encode()
Fixed Code:
with open('p.gif', 'rb') as file:
image = file.read() # read file bytes
str_image = image.decode("utf-8") #using decode we changed the bytes to a string
with open('image.txt', 'w') as file:
file.write(str_image) # write image as string to a text file
with open('image.txt', 'r') as file
str_from_file = file.read() # read the text file and store the string
file_bytes = str_from_file.encode("utf-8") # encode the image str back to bytes
print(type(str_from_file)) #type is str
print(type(file_bytes)) # types is bytes
I hope this fixes your issue and also doesn't include vulnerabilties in what your building

How can I read the "–" character?

I am working with Pycharm and I get my data from a separate file. This data contains this character: '–', that looks like a hyphen but apparently isn't.
This isn't an issue as long as I copy the data directly as a string, but if I read it from a file then '–' gets replaced by '–'
Here is a minimal example:
with open('data.html', 'r') as file:
data = file.read()
print(data)
where data.html is:
example–example
prints:
example–example
I get the same encoding issue when I open data.html with Firefox.
What can I do so that this character is correctly read from the file?
Try to add
encoding="utf-8"
in your open(): open('data.html', 'r', encoding="utf-8")
reference: Hyphen changing to special character –
try to write code like this
with open('data.html', 'r', encoding='utf-8') as file:
data = file.read()
print(data)

Pasting ISO-8859-1 characters into Python IDLE - IDLE changes them

I have some lines in a text document I am trying to replace/remove. The document is in the ISO-8859-1 character encoding.
When I try to copy this line into my Python script to replace it, it won't match. If I shorten the line and remove up until the first double quotation mark " it will replace it fine.
i.e.
desc = [x.replace('Random text “^char”:', '') for x in desc]
This will not match. If I enter:
desc = [x.replace('Random text :', '') for x in desc]
It matches fine. I have checked that it isn't the ^ symbol as well.
Clearly Python IDLE is not using the same character set as my text file and is changing the symbol when I paste it into the script. So how do I get my script to look for this line if it doesn't handle the same characters?
Unfortunately, there's no sure-fire way to determine the encoding of a plain text document, although there are packages that can make very good guesses by analyzing the contents of the document. One popular 3rd-party module for encoding detection is chardet. Or you could manually use trial and error with some popular encodings and see what works.
Once you've determined the correct encoding, the replacement operation itself is simple in Python 3. The core idea is to pass the encoding to the open function, so that you can write Unicode string objects to the file, or read Unicode string objects from the file. Here's a short demo. This will work correctly if the encoding of your terminal is set to UTF-8. I've tested it on Python 3.6.0, both in the Bash shell and in idle3.6.
fname = 'test.txt'
encoding = 'cp1252'
data = 'This is some Random text “^char”: for testing\n'
print(data)
# Save the text to file
with open(fname, 'w', encoding=encoding) as f:
f.write(data)
# Read it back in
with open(fname, 'r', encoding=encoding) as f:
text = f.read()
print(text, text == data)
# Perform the replacement
target = 'Random text “^char”:'
out = text.replace(target, 'XXX')
print(out)
output
This is some Random text “^char”: for testing
This is some Random text “^char”: for testing
True
This is some XXX for testing

Reading input files with ASCII 215 as delimiter

I am trying to read from a file which has word pairs delimited by ASCII value 215. When I run the following code:
f = open('file.i', 'r')
for line in f.read().split('×'):
print line
I get a string that looks like garbage. Here is a sample of my input:
abashedness×N
abashment×N
abash×t
abasia×N
abasic×A
abasing×t
Abas×N
abatable×A
abatage×N
abated×V
abatement×N
abater×N
Abate×N
abate×Vti
abating×V
abatis×N
abatjours×p
abatjour×N
abator×N
abattage×N
abattoir×N
abaxial×A
and here is my output after the code above is run:
z?Nlner?N?NANus?A?hion?hk?hhn?he?hanoconiosis?N
My goal is to eventually read this into either a list of tuples or something of that nature, but I'm having trouble just getting the data to print.
Thanks for all help.
Well, two things:
Your source could be Unicode! Use an escape and be safe.
Read in binary mode.
with open("file.i", "rb") as f:
for line in f.read().split(b"\xd7"):
print(line)
The character is delimiting the word and the part of speech, but each word is still on its own line:
with open('file.i', 'rb') as handle:
for line in handle:
word, pos = line.strip().split('×')
print word, pos
Your code was splitting the whole file, so you were ending up with words like N\nabatable, N\nAbate, Vti\nabating.
To interpret bytes from a file as text, you need to know its character encoding. There Ain't No Such Thing As Plain Text. You could use codecs module to read the text:
import codecs
with codecs.open('file.i', 'r', encoding='utf-8') as file:
for line in file:
word, sep, suffix = line.partition(u'\u00d7')
if sep:
print word
Put the actual character encoding of the file instead of utf-8 placeholder e.g., cp1252.
Non-ascii characters in string literals would require source character encoding declaration at the top the script so I've used the unicode escape: u'\u00d7'.
Thanks to both your help, I was able to hack together this bit of code that returns a list of lists holding what I'm looking for.
with open("mobyposi.i", "rb") as f:
content = f.readlines()
f.close()
content = content[0].split()
for item in content:
item.split("\xd7")
It was indeed in unicode! However, the implementation described above discarded the text after the unicode value and before the newline.
EDIT: Able to reduce to:
with open("mobyposi.i", "rb") as f:
for item in f.read().split():
item.split("\xd7")

using txt file as input for python

I have a python program that requires the user to paste texts into it to process them to the various tasks. Like this:
line=(input("Paste text here: ")).lower()
The pasted text comes from a .txt file. To avoid any issues with the code (since the text contains multiple quotation marks), the user has to do the following: type 3 quotation marks, paste the text, and type 3 quotation marls again.
Can all of the above be avoided by having python read the .txt? and if so, how?
Please let me know if the question makes sense.
In Python2, just use raw_input to receive input as a string. No extra quotation marks on the part of the user are necessary.
line=(raw_input("Paste text here: ")).lower()
Note that input is equivalent to
eval(raw_input(prompt))
and applying eval to user input is dangerous, since it allows the user to evaluate arbitrary Python expressions. A malicious user could delete files or even run arbitrary functions so never use input in Python2!
In Python3, input behaves like raw_input, so there your code would have been fine.
If instead you'd like the user to type the name of the file, then
filename = raw_input("Text filename: ")
with open(filename, 'r') as f:
line = f.read()
Troubleshooting:
Ah, you are using Python3 I see. When you open a file in r mode, Python tries to decode the bytes in the file into a str. If no encoding is specified, it uses locale.getpreferredencoding(False) as the default encoding. Apparently that is not the right encoding for your file. If you know what encoding your file is using, it is best to supply it with the encoding parameter:
open(filename, 'r', encoding=...)
Alternatively, a hackish approach which is not nearly as satisfying is to ignore decoding errors:
open(filename, 'r', errors='ignore')
A third option would be to read the file as bytes:
open(filename, 'rb')
Of course, this has the obvious drawback that you'd then be dealing with bytes like \x9d rather than characters like ·.
Finally, if you'd like some help guessing the right encoding for your file, run
with open(filename, 'rb') as f:
contents = f.read()
print(repr(contents))
and post the output.
You can use the following:
with open("file.txt") as fl:
file_contents = [x.rstrip() for x in fl]
This will result in the variable file_contents being a list, where each element of the list is a line of your file with the newline character stripped off the end.
If you want to iterate over each line of the file, you can do this:
with open("file.txt") as fl:
for line in fl:
# Do something
The rstrip() method gets rid of whitespace at the end of a string, and it is useful for getting rid of the newline character.

Categories

Resources