Decode from Escaped Unicode to Arabic using Python

Decode from Escaped Unicode to Arabic using Python - python

I was trying to decode a json file that has escaped unicode text /uHHH .. the original text is Arabic
my research lead me to the following code using python.
s = '\u00d8\u00b5\u00d9\u0088\u00d8\u00b1 \u00d8\u00a7\u00d9\u0084\u00d9\u008a\u00d9\u0088\u00d9\u0085\u00d9\u008a\u00d8\u00a7\u00d8\u00aa'
ouy= s.encode('utf-8').decode('unicode-escape').encode('latin1').decode('utf-8')
print(ouy)
the result text will be: ØµÙØ± Ø§ÙÙÙÙÙØ§Øª
which still needs some fix using online tool to become the original text: صور اليوميات
Is there any way to perform that fix using the above code?
Would appreciate your help guys, thanks in advance

you can use this script to update all JSON files
import json
filename = 'YourFile.json' # file name we want to compress
newname = filename.replace('.json', '.min.json') # Output file name
with open(filename, encoding="utf8") as fp:
print("Compressing file: " + filename)
print('Compressing...')
jload = json.load(fp)
newfile = json.dumps(jload, indent=None, separators=(',', ':'), ensure_ascii=False)
newfile = newfile.encode('latin1').decode('utf-8') # remove this
#print(newfile)
with open(newname, 'w', encoding="utf8") as f: # add encoding="utf8"
f.write(newfile)
print('Compression complete!')
DecodeJsonToOrigin

Related

How to display cyrillic text from a file in python?

I want to read some cyrilic text from a txt file in Python 3.
This is what the text file contains.
абцдефгчийклмнопярстувшхыз
I used:
with open('text.txt', 'r') as myfile:
text=myfile.read()
print (text)
But this is the ouput in the python shell:
ÿþ01F45D3G89:;<=>?O#ABC2HEK7
Can someone explain why this is the output?

Python supports utf-8 for this sort of thing.
You should be able to do:
with open('text.txt', encoding = 'utf-8', mode = 'r') as my_file:
...
Also, be sure that your text file is saved with utf-8 encoding. I tested this in my shell and without proper encoding my output was:
?????????????????????
With proper encoding:
file = open('text.txt', encoding='utf-8', mode='r')
text = file.read()
print(text)
абцдефгчийклмнопярстувшхы

Try working on the file using codecs, you need to
import codecs
and then do
text = codecs.open('text.txt', 'r', 'utf-8')
Basically you need utf8

Do I have to encode unicode variable before write to file?

I read the "Unicdoe Pain" article days ago. And I keep the "Unicode Sandwich" in mind.
Now I have to handle some Chinese and I've got a list
chinese = [u'中文', u'你好']
Do i need to proceed encoding before writing to file?
add_line_break = [word + u'\n' for word in chinese]
encoded_chinese = [word.encode('utf-8') for word in add_line_break]
with open('filename', 'wb') as f:
f.writelines(encoded_chinese)
Somehow I find out that in python2. I can do this:
chinese = ['中文', '你好']
with open('filename', 'wb') as f:
f.writelines(chinese)
no unicode matter involed. :D

You don't have to do that, you could use io or codecs to open the file with encoding.
import io
with io.open('file.txt', 'w', encoding='utf-8') as f:
f.write(u'你好')
codecs.open has the same syntax.

In python3;
with open('file.txt, 'w', encoding='utf-8') as f:
f.write('你好')
will do just fine.

How do I read / write a file in Python (3) on Windows without introducing carriage returns?

I want to open a file using Python on Windows, perform some regex operations, optionally alter the content and then write the result back to a file.
I can create an example file which looks right (based on the comments on using binary mode in other posts on SO and within the documentation). What I can't see is how I convert the 'binary' data to a usable form without introducing '\r' characters.
An example:
import re
# Create an example file which represents the one I'm actually working on (a Jenkins config file if you're interested).
testFileName = 'testFile.txt'
with open(testFileName, 'wb') as output_file:
output_file.write(b'this\nis\na\ntest')
# Try and read the file in as I would in the script I was trying to write.
content = ""
with open(testFileName, 'rb') as content_file:
content = content_file.read()
# Do something to the content
exampleRegex = re.compile("a\\ntest")
content = exampleRegex.sub("a\\nworking\\ntest", content) # <-- Fails because it won't operate on 'binary data'
# Write the file back to disk and then realise, frustratingly that something in this process has introduced carriage returns onto every line.
outputFilename = 'output_'+testFileName
with open(outputFilename, 'wb') as output_file:
output_file.write(content)

I presume you mean, your text file has return carriages and you don't want them included in the text.
If you use
with open(fileName, 'r', encoding="utf-8", errors="ignore", newline="\r\n") as content_file
or more specifically, set newline="\r\n" in your open call, it should consume the return carriages on new lines.
Edit: Or if you want to operate only on \n then this working example should do it.
import re
testFileName = 'testFile.txt'
with open(testFileName, 'w', newline='\n') as output_file:
output_file.write('this\nis\na\ntest')
content = ""
with open(testFileName, 'r', newline='\n') as content_file:
content = content_file.read()
exampleRegex = re.compile("a\\ntest")
content = exampleRegex.sub("a\\nworking\\ntest", content)
outputFilename = 'output_'+testFileName
with open(outputFilename, 'w', newline='\n') as output_file:
output_file.write(content)

If I interpreted the question correctly, I first decoded the bytes to string, then did the regex sub. Next, I encoded the string into bytes to be written into the output file.
import re
testFileName = 'testFile.txt'
with open(testFileName, 'wb') as output_file:
output_file.write(b'this\nis\na\ntest')
content = ""
with open(testFileName, 'rb') as content_file:
content = content_file.read().decode('utf-8')
exampleRegex = re.compile("a\\ntest")
content = exampleRegex.sub("a\\nworking\\ntest", content)
outputFilename = 'output_'+testFileName
with open(outputFilename, 'wb') as output_file:
output_file.write(content.encode('utf-8'))

How to add encoding in python askopenfile

I was wondering, is there any chance to add encoding right into tkFileDialog.askopenfile body?
Right now i use askopenfilename instead and then use codecs to encode file like this:
def open():
filename = askopenfilename(filetypes=[("Text files","*.xml")])
if filename == '':
return
with codecs.open(filename, encoding='utf-8') as f:
txt = f.read()
delete(xml)
xml.insert("1.0", txt)
What i'm asking is, how to do the same but using askopenfile, something like this:
with askopenfile(filetypes=[("Text files","*.xml")], encoding='utf-8') as f:
Any suggestions or other approaches to do the same are strongly appreciated.

Just merge the askopenfilename() into the with():
with codecs.open(askopenfilename(filetypes=[("Text files","*.xml")]), encoding='utf-8') as f:

Replace and overwrite instead of appending

I have the following code:
import re
#open the xml file for reading:
file = open('path/test.xml','r+')
#convert to string:
data = file.read()
file.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
file.close()
where I'd like to replace the old content that's in the file with the new content. However, when I execute my code, the file "test.xml" is appended, i.e. I have the old content follwed by the new "replaced" content. What can I do in order to delete the old stuff and only keep the new?

You need seek to the beginning of the file before writing and then use file.truncate() if you want to do inplace replace:
import re
myfile = "path/test.xml"
with open(myfile, "r+") as f:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
f.truncate()
The other way is to read the file then open it again with open(myfile, 'w'):
with open(myfile, "r") as f:
data = f.read()
with open(myfile, "w") as f:
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
Neither truncate nor open(..., 'w') will change the inode number of the file (I tested twice, once with Ubuntu 12.04 NFS and once with ext4).
By the way, this is not really related to Python. The interpreter calls the corresponding low level API. The method truncate() works the same in the C programming language: See http://man7.org/linux/man-pages/man2/truncate.2.html

file='path/test.xml'
with open(file, 'w') as filetowrite:
filetowrite.write('new content')
Open the file in 'w' mode, you will be able to replace its current text save the file with new contents.

Using truncate(), the solution could be
import re
#open the xml file for reading:
with open('path/test.xml','r+') as f:
#convert to string:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
f.truncate()

import os#must import this library
if os.path.exists('TwitterDB.csv'):
os.remove('TwitterDB.csv') #this deletes the file
else:
print("The file does not exist")#add this to prevent errors
I had a similar problem, and instead of overwriting my existing file using the different 'modes', I just deleted the file before using it again, so that it would be as if I was appending to a new file on each run of my code.

See from How to Replace String in File works in a simple way and is an answer that works with replace
fin = open("data.txt", "rt")
fout = open("out.txt", "wt")
for line in fin:
fout.write(line.replace('pyton', 'python'))
fin.close()
fout.close()

in my case the following code did the trick
with open("output.json", "w+") as outfile: #using w+ mode to create file if it not exists. and overwrite the existing content
json.dump(result_plot, outfile)

Using python3 pathlib library:
import re
from pathlib import Path
import shutil
shutil.copy2("/tmp/test.xml", "/tmp/test.xml.bak") # create backup
filepath = Path("/tmp/test.xml")
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))
Similar method using different approach to backups:
from pathlib import Path
filepath = Path("/tmp/test.xml")
filepath.rename(filepath.with_suffix('.bak')) # different approach to backups
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Decode from Escaped Unicode to Arabic using Python - python

Related

How to display cyrillic text from a file in python?

Do I have to encode unicode variable before write to file?

How do I read / write a file in Python (3) on Windows without introducing carriage returns?

How to add encoding in python askopenfile

Replace and overwrite instead of appending

Categories

Resources