How can I read the "–" character? - python

I am working with Pycharm and I get my data from a separate file. This data contains this character: '–', that looks like a hyphen but apparently isn't.
This isn't an issue as long as I copy the data directly as a string, but if I read it from a file then '–' gets replaced by '–'
Here is a minimal example:
with open('data.html', 'r') as file:
data = file.read()
print(data)
where data.html is:
example–example
prints:
example–example
I get the same encoding issue when I open data.html with Firefox.
What can I do so that this character is correctly read from the file?

Try to add
encoding="utf-8"
in your open(): open('data.html', 'r', encoding="utf-8")
reference: Hyphen changing to special character –

try to write code like this
with open('data.html', 'r', encoding='utf-8') as file:
data = file.read()
print(data)

Related

Python: convert strings containing unicode code point back into normal characters

I'm working with the requests module to scrape text from a website and store it into a txt file using a method like below:
r = requests.get(url)
with open("file.txt","w") as filename:
filename.write(r.text)
With this method, say if "送分200000" was the only string that requests got from url, it would've been decoded and stored in file.txt like below.
\u9001\u5206200000
When I grab the string from file.txt later on, the string doesn't convert back to "送分200000" and instead remains at "\u9001\u5206200000" when I try to print it out. For example:
with open("file.txt", "r") as filename:
mystring = filename.readline()
print(mystring)
Output:
"\u9001\u5206200000"
Is there a way for me to convert this string and others like it back to their original strings with unicode characters?
convert this string and others like it back to their original strings with unicode characters?
Yes, let file.txt content be
\u9001\u5206200000
then
with open("file.txt","rb") as f:
content = f.read()
text = content.decode("unicode_escape")
print(text)
output
送分200000
If you want to know more read Text Encodings in codecs built-in module docs
It's better to use the io module for that. Try and adapt the following code for your problem.
import io
with io.open(filename,'r',encoding='utf8') as f:
text = f.read()
# process Unicode text
with io.open(filename,'w',encoding='utf8') as f:
f.write(text)
Taken from https://www.tutorialspoint.com/How-to-read-and-write-unicode-UTF-8-files-in-Python
I am guessing you are using Windows. When you open a file, you get its default encoding, which is Windows-1252, unless you specify otherwise. Specify the encoding when you open the file:
with open("file.txt","w", encoding="UTF-8") as filename:
filename.write(r.text)
with open("file.txt", "r", encoding="UTF-8") as filename:
mystring = filename.readline()
print(mystring)
That works as you expect regardless of platform.

What is the simplest way to fix an existing csv unicode utf-8 without BOM file not displaying correctly in excel?

I have the task of converting utf-8 csv file to excel file, but it is not read properly in excel. Because there was no byte order mark (BOM) at the beginning of the file
I see how:
https://stackoverflow.com/a/38025106/6102332
with open('test.csv', 'w', newline='', encoding='utf-8-sig') as f:
w = csv.writer(f)
# Write Unicode strings.
w.writerow([u'English', u'Chinese'])
w.writerow([u'American', u'美国人'])
w.writerow([u'Chinese', u'中国人'])
But it seems like that only works with brand new files.
But not work for my file already has data.
Are there any easy ways to share?
Is there any other way than this? : https://stackoverflow.com/a/6488070/6102332
Save the exported file as a csv
Open Excel
Import the data using Data-->Import External Data --> Import Data
Select the file type of "csv" and browse to your file
In the import wizard change the File_Origin to "65001 UTF" (or choose correct language character identifier)
Change the Delimiter to comma
Select where to import to and Finish
Read the file in and write it back out with the encoding desired:
with open('input.csv','r',encoding='utf-8-sig') as fin:
with open('output.csv','w',encoding='utf-8-sig') as fout:
fout.write(fin.read())
utf-8-sig codec will remove BOM if present on read, and will add BOM on write, so the above can safely run on files with or without BOM originally.
You can convert in place by doing:
file = 'test.csv'
with open(file,'r',encoding='utf-8-sig') as f:
data = f.read()
with open(file,'w',encoding='utf-8-sig') as f:
f.write(data)
Note also that utf16 works as well. Some older Excels don't handle UTF-8 correctly.
Thank You!
I have found a way to automatically handle the missing BOM utf-8 signature.
In addition to the lack of BOM signature, there is another problem is that duplicate BOM signature is mixed in the file data. Excel does not show clearly and transparently. and make a mistake other data when compared, calculated. eg :
data -> Excel
Chinese -> Chinese
12 -> 12
If you compare it, obviously ChineseBOM will not be equal to Chinese.
Code python to solve the problem:
import codecs
bom_utf8 = codecs.BOM_UTF8
def fix_duplicate_bom_utf8(file, bom=bom_utf8):
with open(file, 'rb') as f:
data_f = f.read()
data_finish = bom + data_f.replace(bom, b'')
with open(file, 'wb') as f:
f.write(data_finish)
return
# Use:
file_csv = r"D:\data\d20200114.csv" # American, 美国人
fix_duplicate_bom_utf8(file_csv)
# file_csv -> American, 美国人

Printing single quote become weird character in .txt-file

I used Selenium to scrape some sentences, and then print the result to a .txt-file, but it cannot display ' but instead some weird character:
Original sentence:
I don't think so.
in .txt file:
I don? think so.
I have specified the .txt encoding to "utf-8" already, what should I do?
You need to open the file for writing in "utf-8" for that:
with open("file.txt", "w", encoding="utf-8") as file
file.write("your_text")
Hope it helps you!

Pasting ISO-8859-1 characters into Python IDLE - IDLE changes them

I have some lines in a text document I am trying to replace/remove. The document is in the ISO-8859-1 character encoding.
When I try to copy this line into my Python script to replace it, it won't match. If I shorten the line and remove up until the first double quotation mark " it will replace it fine.
i.e.
desc = [x.replace('Random text “^char”:', '') for x in desc]
This will not match. If I enter:
desc = [x.replace('Random text :', '') for x in desc]
It matches fine. I have checked that it isn't the ^ symbol as well.
Clearly Python IDLE is not using the same character set as my text file and is changing the symbol when I paste it into the script. So how do I get my script to look for this line if it doesn't handle the same characters?
Unfortunately, there's no sure-fire way to determine the encoding of a plain text document, although there are packages that can make very good guesses by analyzing the contents of the document. One popular 3rd-party module for encoding detection is chardet. Or you could manually use trial and error with some popular encodings and see what works.
Once you've determined the correct encoding, the replacement operation itself is simple in Python 3. The core idea is to pass the encoding to the open function, so that you can write Unicode string objects to the file, or read Unicode string objects from the file. Here's a short demo. This will work correctly if the encoding of your terminal is set to UTF-8. I've tested it on Python 3.6.0, both in the Bash shell and in idle3.6.
fname = 'test.txt'
encoding = 'cp1252'
data = 'This is some Random text “^char”: for testing\n'
print(data)
# Save the text to file
with open(fname, 'w', encoding=encoding) as f:
f.write(data)
# Read it back in
with open(fname, 'r', encoding=encoding) as f:
text = f.read()
print(text, text == data)
# Perform the replacement
target = 'Random text “^char”:'
out = text.replace(target, 'XXX')
print(out)
output
This is some Random text “^char”: for testing
This is some Random text “^char”: for testing
True
This is some XXX for testing

How to get python to open an outside file?

I am writing a program for class that opens a file, counts the words, returns the number of words, and closes. I understand how to do everything excpet get the file to open and display the text This is what I have so far:
fname = open("C:\Python32\getty.txt")
file = open(fname, 'r')
data = file.read()
print(data)
The error I'm getting is:
TypeError: invalid file: <_io.TextIOWrapper name='C:\\Python32\\getty.txt' mode='r'
encoding='cp1252'>
The file is saved in the correct place and I have checked spelling, etc. I am using pycharm to work on this and the file that I am trying to open is in notepad.
You're using open() twice, so you've actually already opened the file, and then you attempt to open the already opened file object... change your code to:
fname = "C:\\Python32\\getty.txt"
infile = open(fname, 'r')
data = infile.read()
print(data)
The TypeError is saying that it cannot open type _io.TextIOWrapper which is what open() returns when opening a file.
Edit: You should really be handling files like so:
with open(r"C:\Python32\getty.txt", 'r') as infile:
data = infile.read()
print(data)
because when the with statement block is finished, it will handle file closing for you, which is very nice.
The r before the string will prevent python from interpreting it, leaving it exactly how you formed it.
Problem in the first line. Should be a simple assignment without the open. i.e. fname = "c:\Python32\getty.txt. Also, you'll be better off to escape the backslash (e.g. '\') or put an 'r' for the string literal (this isn't a problem with your specific program, buy may become a problem if you had a special character in your file name). Overall the program should be:
fname = r"c:\Python32\getty.txt"
file = open(fname,'r')
data = file.read()
print (data)
Put name part after file like:
data = file.name.read()
You are getting such errors because when you are writing directory of your file you are using a backslash \ and this is not good. You should use a forward slash /. E.g
file_ = open("C:/Python32/getty.txt", "r")
read = file_.read()
file_.close()
print read
From now on you got all file code under read.
file mode ('r', 'U', 'w', 'a', possibly with 'b' or '+' added)
Edit:
If you don't want to change the slashes then simply add an r before the string: r"path"
fname = r"C:\Python32\getty.txt"
file_ = open(fname, 'r')
data = file_.read()
print data

Categories

Resources