Python: Issue printing to file special characters (Spanish alphabet) - python

I'm making an an algorithm to classify words with the number of times they appear in a text given by a file.
There is my method:
def printToFile(self, fileName):
file_to_print = open(fileName, 'w')
file_to_print.write(str(self))
file_to_print.close()
and there is the str:
def __str__(self):
cadena = ""
self.processedWords = collections.OrderedDict(sorted(self.processedWords.items()))
for key in self.processedWords:
cadena += str(key) + ": " + str(self.processedWords[key]) + "\n"
return cadena.decode('string_escape')
When I print the data through console there is no issues, nevertheless, when I do through file appears random characters.
This is should be the output to the file
This is the output given

This looks like a encoding issue, try opening the file like this:
open("file name","w",encoding="utf8")
Utf8 is the most popular encoding but it might not be the real encoding, you might have to check out other encodings such as utf16

Related

TypeError: the JSON object must be str, bytes or bytearray, not NoneType (while Using googletrans API)

I'm trying to translate a yml file using the googletrans API.
This is my code:
#Import
from googletrans import Translator
import re
# API
translator = Translator()
# Counter
counter_DoNotTranslate = 0
counter_Translate = 0
#Translater
with open("ValuesfileNotTranslatedTest.yml") as a_file: #Values file not translated
for object in a_file:
stripped_object = object.rstrip()
found = False
file = open("ValuesfileTranslated.yml", "a") #Translated file
if "# Do not translate" in stripped_object: #Dont translate lines with "#"
counter_DoNotTranslate += 1
file.writelines(stripped_object + "\n")
else: #Translates english to dutch and appends
counter_Translate += 1
results = translator.translate(stripped_object, src='en', dest='nl')
translatedText = results.text
file.writelines(re.split('|=', translatedText, maxsplit=1)[-1].strip() + "\n" )
#Print
print("# Do not translate found: " + str(counter_DoNotTranslate))
print("Words translated: " + str(counter_Translate))
This is the yml file I want to translate:
'Enter a section title'
'Enter a description of the section. This will also be shown on the course details page'
'Title'
'Description'
'Start date'
'End date'
Published
Section is optional
Close discussions?
'Enter a title'
But when I try to run the code I get the following error:
File "/Users/AndreB/Library/Python/3.9/lib/python/site-packages/googletrans/client.py", line 219, in translate
parsed = json.loads(data[0][2])
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
I think the problem is that there are different whitespaces in the yml file, so I tried adding
if stripped_object is None: #This would skip the lines in the yaml file where there are whitespaces
file.writelines(stripped_object + "\n")
to the code. But I still get the same error message.
Does anyone have an idea how I can fix this?
There are quite a lot of problems with the code you present, none of which is causing the problem. The problem is, indeed, likely caused by blank lines in the yml file, but your test is incorrect:
"" is None # False
" " is None # also False
not "" # True
not " " # False
not " ".strip() # True
So the correct way to test for a line consisting of zero or more whitespace chars is to take the truthiness of line.strip(). In this case your gate would be:
if not line.strip():
out.write("\n")
Which brings me to the other problems with this code:
your variable names shadow internal names (object, file)
you open the output file for every line in the input file (and never close it) despite correctly using a context manager in the first case
your variable names mix conventions (snake_case and camelCase)
Here's a draft of what a function might look like which avoids these problems:
from pathlib import Path
from googletrans import Translator
translator = Translator()
def translate_file(infn: str | Path, outfn: str | Path, src="en", dest="dl") -> Tuple[int, int]:
inf = Path(infn)
outf = Path(outfn)
translated = 0
skipped = 0
with infn.open() as inf, outfn.open("w") as outf:
for line in inf:
if not line.strip():
outf.write("\n")
elif "# Do not translate" in line:
outf.write(line)
skipped += 1
else:
outf.write(translate.translate(line, src=src, dest=dest))
translated += 1
return translated, skipped
There are other things you doubtless want to do, and I don't understand your code to handle the response from translate.translate() (doubtless because I have never used the library).
Note that if you do actually want to translate real yml, you would be much better first parsing it, then translating the bits of the tree which need translating, and then dumping it back to disk. Working line by line is going to break sooner or later with valid syntax which doesn't work linewise.

How to translate encoding by ansi into unicode

When I use the CountVectorizer in sklearn, it needs the file encoding in unicode, but my data file is encoding in ansi.
I tried to change the encoding to unicode using notepad++, then I use readlines, it cannot read all the lines, instead it can only read the last line. After that, I tried to read the line into data file, and write them into the new file by using unicode, but I failed.
def merge_file():
root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
resname='resule_final.txt'
if os.path.exists(resname):
os.remove(resname)
result = codecs.open(resname,'w','utf-8')
num = 1
for back_name in os.listdir(r'd:\\workspace\\minibatchk-means\\data\\20_newsgroups'):
current_dir = root_dir + str(back_name)
for filename in os.listdir(current_dir):
print num ,":" ,str(filename)
num = num+1
path=current_dir + "\\" +str(filename)
source=open(path,'r')
line = source.readline()
line = line.strip('\n')
line = line.strip('\r')
while line !="":
line = unicode(line,"gbk")
line = line.replace('\n',' ')
line = line.replace('\r',' ')
result.write(line + ' ')
line = source.readline()
else:
print 'End file :'+ str(filename)
result.write('\n')
source.close()
print 'End All.'
result.close()
The error message is :UnicodeDecodeError: 'gbk' codec can't decode bytes in position 0-1: illegal multibyte sequence
Oh,I find the way.
First, use chardet to detect string encoding.
Second,use codecs to input or output to the file in the specific encoding.
Here is the code.
import chardet
import codecs
import os
root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
num = 1
failed = []
for back_name in os.listdir("d:\\workspace\\minibatchk-means\\data\\20_newsgroups"):
current_dir = root_dir + str(back_name)
for filename in os.listdir(current_dir):
print num,":",str(filename)
num=num+1
path=current_dir+"\\"+str(filename)
content = open(path,'r').read()
source_encoding=chardet.detect(content)['encoding']
if source_encoding == None:
print '??' , filename
failed.append(filename)
elif source_encoding != 'utf-8':
content=content.decode(source_encoding,'ignore')
codecs.open(path,'w',encoding='utf-8').write(content)
print failed
Thanks for all your help.

I need to check if my file starts with a BOM and is encoded in utf-16-le

Here is what I tried:
I found this doc : How to guess the encoding of a document?
So I tried a piece of the code:
def check_encoding(self, filename):
data = open(self.input_path_value.get() + "/" + filename, 'r')
if data.startswith(codecs.BOM_UTF16_LE):
return True
else:
return False
But it doesn't understand the function startswith(). I just need to check the document first characters (where the BOM is located). And my files can have a size of 9Go so I can't put the text in RAM.
I also tried to do something like:
try:
data = open(self.input_path_value.get() + "/" + filename, 'r', encoding = 'utf-16-le')
return True
except:
return False
But it doesn't really check if there's a BOM and sometimes it works but it's not really utf16 encoded.
Any ideas how to check this simply?

Write user input to file python

I can't figure out how to write the user input to an existing file. The file already contains a series of letters and is called corpus.txt . I want to take the user input and add it to the file , save and close the loop.
This is the code I have :
if user_input == "q":
def write_corpus_to_file(mycorpus,myfile):
fd = open(myfile,"w")
input = raw_input("user input")
fd.write(input)
print "Writing corpus to file: ", myfile
print "Goodbye"
break
Any suggestions?
The user info code is :
def segment_sequence(corpus, letter1, letter2, letter3):
one_to_two = corpus.count(letter1+letter2)/corpus.count(letter1)
two_to_three = corpus.count(letter2+letter3)/corpus.count(letter2)
print "Here is the proposed word boundary given the training corpus:"
if one_to_two < two_to_three:
print "The proposed end of one word: %r " % target[0]
print "The proposed beginning of the new word: %r" % (target[1] + target[2])
else:
print "The proposed end of one word: %r " % (target[0] + target[1])
print "The proposed beginning of the new word: %r" % target[2]
I also tried this :
f = open(myfile, 'w')
mycorpus = ''.join(corpus)
f.write(mycorpus)
f.close()
Because I want the user input to be added to the file and not deleting what is already there, but nothing works.
Please help!
Open the file in append mode by using "a" as the mode.
For example:
f = open("path", "a")
Then write to the file and the text should be appended to the end of the file.
That code example works for me:
#!/usr/bin/env python
def write_corpus_to_file(mycorpus, myfile):
with open(myfile, "a") as dstFile:
dstFile.write(mycorpus)
write_corpus_to_file("test", "./test.tmp")
The "with open as" is a convenient way in python to open a file, do something with it while within the block defined by the "with" and let Python handles the rest once you exit it (like, for example, closing the file).
If you want to write the input from the user, you can replace mycorpus with your input (I am not too sure what you want to do from your code snippets).
Note that no carriage return is added by the write method. You probably want to append a "\n" at the end :-)

Swedish characters in python error

I am making a program that uses words with Swedish characters and stores them in a list. I can print Swedish characters before I put them into a list, but after they are put in, they do not appear normally, just a big mess of characters.
Here is my code:
# coding=UTF-8
def get_word(lines, eng=0):
if eng == 1: #function to get word in english
word_start = lines[1]
def do_format(word, lang):
if lang == "sv":
first_word = word
second_word = translate(word, lang)
element = first_word + " - " + second_word
elif lang == "en":
first_word = translate(word, lang)
second_word = word
element = first_word + " - " + second_word
return element
def translate(word, lang):
if lang == "sv":
return "ENGLISH"
if lang == "en":
return "SWEDISH"
translated = []
path = "C:\Users\LK\Desktop\Dropbox\Dokumentai\School\Swedish\V47.txt"
doc = open(path, 'r') #opens the documen
doc_list = [] #the variable that will contain list of words
for lines in doc.readlines(): #repeat as many times as there are lines
if len(lines) > 1: #ignore empty spaces
lines = lines.rstrip() #don't add "\n" at the end
doc_list.append(lines) #add to the list
for i in doc_list:
print i
for i in doc_list:
if "-" in i:
if i[0] == "-":
element = do_format(i[2:], "en")
translated.append(element)
else:
translated.append(i)
else:
element = do_format(i, "sv")
translated.append(element)
print translated
raw_input()
I can simplify the problem to a simple code as:
# -*- coding: utf-8 -*-
test_string = "ö"
test_list = ["å"]
print test_string, test_list
If I run that, I get this
ö ['\xc3\xa5']
There are multiple things to notice:
The broken character. This seems to happen because your python seems to output UTF-8 but your terminal seems to be configured to some ISO-8859-X mode (hence the two characters). I'd try to use proper unicode strings in Python 2! (always u"ö" instead of "ö"). And check your locale settings (locale command when on linux)
The weird string in the list. In Python print e will print out str(e). For lists (such as ["å"]) the implementation of __str__ is the same as __repr__. And since repr(some_list) will call repr on any of the elements contained in the list, you end up with the string you see.
Example for repr(string):
>>> print u"ö"
ö
>>> print repr(u"ö")
u'\xf6'
>>> print repr("ö")
'\xc3\xb6'
If you print list then it can be print as some structure. You should convert it to string for example by using join() string method. With your test code it may looks like:
print test_string, test_list
print('%s, %s, %s' % (test_string, test_list[0], ','.join(test_list)))
And output:
ö ['\xc3\xa5']
ö, å, å
I think in your main program you can:
print('%s' % (', '.join(translated)))
You can use codecs module to specify encoding of the read bytes.
import codecs
doc = codecs.open(path, 'r', encoding='utf-8') #opens the document
Files opened with codecs.open will give you unicode string after decoding the raw bytes with specified encoding.
In you code, prefix your string literals with u, to make them unicode string.
# -*- coding: utf-8 -*-
test_string = u"ö"
test_list = [u"å"]
print test_string, test_list[0]

Categories

Resources