How to split the line with ASCII characters (\u9078) in python - python

When convert the properties to JSON it added extra backslash in ASCII character, How to avoid this, see the code below
Input File (sample.properties)
property.key.CHOOSE=\u9078\u629e
Code
import json
def convertPropertiesToJson(fileName, outputFileName, sep='=', comment_char='#'):
props = {}
with open(fileName, "r") as f:
for line in f:
l = line.strip()
if l and not l.startswith(comment_char):
innerProps = {}
keyValueList = l.split(sep)
key = keyValueList[0].strip()
keyList = key.split('.')
value = sep.join(keyValueList[1:]).strip()
if keyList[1] not in props:
props[keyList[1]] = {}
innerProps[keyList[2]] = value
props[keyList[1]].update(innerProps)
with open(outputFileName, 'w') as outfile:
json.dump(props, outfile)
convertPropertiesToJson("sample.properties", "sample.json")
Output: (sample.json)
{"key": {"CHOOSE": "\\u9078\\u629e"}}
Expected Result:
{"key": {"CHOOSE": "\u9078\u629e"}}

The problem is the input is read as-is, and \u is copied literally as two characters. The easiest fix is probably this:
with open(fileName, "r", encoding='unicode-escape') as f:
This is will decode the escaped unicode characters.

I don't know solution to your problem but I found out where problem occurs.
with open('sample.properties', encoding='utf-8') as f:
for line in f:
print(line)
print(repr(line))
d = {}
d['line'] = line
print(d)
out:
property.key.CHOOSE=\u9078\u629e
'property.key.CHOOSE=\\u9078\\u629e'
{'line': 'property.key.CHOOSE=\\u9078\\u629e'}
I don't know how adding to dictionary adds repr of string.

The problem seems to be that you have saved unicode characters which are represented as escaped strings. You should decode them at some point.
Changing
l = line.strip()
to (for Python 2.x)
l = line.strip().decode('unicode-escape')
to (for Python 3.x)
l = line.strip().encode('ascii').decode('unicode-escape')
gives the desired output:
{"key": {"CHOOSE": "\u9078\u629e"}}

Related

Txt to list in Python

I have an issue in Python,
I have this text file:
250449825306
331991628894
294132934371
334903836165
And I want to add it to a list of ints in python for example: A=[x1,x2,x3]
I have written this:
f = open('your_file.txt', encoding = "utf8")
z = f.readlines()
print(z)
But it returns z=["x1,x2,x3,x4"]
I am kinda stuck, is there anything I can try here?
Thanks in advance!
Try this:
with open("file.txt", "r") as file:
_list = list(map(int, file.read().split("\n")))
This read the file, splits it based on the "\n" char, and makes all of ints.
because of how the file is a big string youll have newlines inside the files as well
in the text file ill advise not having blank newlines
to split this properly instead of readinglines instead read the entire file and split at the "\n"
f = open('file.txt', encoding="utf8")
z = f.read().split("\n")
print(z)
youll then have to change the values to be int this can be done with a map function but in case of any trailing newlines a try catch will be better
f = open('file.txt', encoding="utf8")
z = f.read().split("\n")
l = []
for x in z:
try:
l.append(int(x))
except:
pass
print(l)

How to save escape sequences to a file in python without double backslashes?

I want to save some mathjax code to a .txt file in python.
x = "$\infty$"
with open("sampletext.txt", "a+") as f:
f.write(x)
Works exactly as expected
sampletext.txt
$\infty$
However when i try to save the escape sequence in a list
x = ["$\infty$"]
with open("sampletext.txt", "a+") as f :
f.write(str(x))
sampletext.txt
['$\\infty$']
How do i remove the double backslash in the latter and save it as ['$\infty$'] ?
Try this:
x = [r"$\infty$"]
with open("sampletext.txt", "a+") as f:
f.write(str(x))
The r means that the string is to be treated as a raw string, which means all escape codes will be ignored.
Maybe this can help you:
x = [r"$\infty$"]
with open("sampletext.txt", "a+") as f:
f.write(''.join(x))
Flag "r" (raw) can be use to save string with special symbols like "\"
Or if you don't know how many items in the list:
x = ["$\infty$"]
with open("sampletext.txt", "a+") as f:
f.write(f"{''.join(x)}")

Json find key contain some char and write key and value to new json file with python

I need find if key contain dash, than need to get this key and value to new json file.
this is my code:
#coding=utf-8
import os
import sys
import json
import fileinput
file_path = sys.argv[1]
file = open(file_path, 'r')
content = file.read()
dict = json.loads(content, encoding="utf-8")
output = "{"
for key in dict:
if key.find("-") != -1:
output = output + "%s: %s" % (key, unicode(dict[key]).encode('utf8'))
print output
output = output + "}"
output = json.dumps(json.loads(output, encoding="utf-8"), indent=4, separators=(', ',': '), ensure_ascii=False).encode('utf8')
file_name = os.path.basename(file_path)
sort_file = open(file_name, 'a')
sort_file.write(output)
sort_file.close()
output file is some kind of:
u'login': u".//input[#placeholder='Email/ \u624b\u6a5f\u865f\u78bc/
Is any way to convert content_dict[key] to utf-8 char not like "\u78bc"?
and have any good way to find key contain some char and write to new json file?
You are using Python 2, and want to be able to read and write json files that contain non-ascii characters.
The easiest way to do this is to perform your processing with unicode only, performing file IO in binary mode and converting raw bytes to json after decoding to unicode when reading, and encoding json to bytes before writing to file.
The code should look something like this:
file_path = sys.argv[1]
# Read data as bytes
with open(file_path, 'rb') as f:
raw_data = f.read()
# Decode bytes to unicode, then convert from json.
dict_ = json.loads(raw_data.decode('utf-8'))
output = {}
for key, value in dict_.iteritems():
# Using the in operator is the Pythonic way to check
# if a character is in a string.
if "-" in key:
output[key] = value
print output
file_name = os.path.basename(file_path)
with open(file_name, 'ab') as f:
j = json.dumps(output, indent=4, separators=(', ', ': '), ensure_ascii=False)
# Encode json unicode string before writing to file.
f.write(j.encode('utf-8'))
In this code I've used the with statement to handle closing open files automatically.
I have also collected the data to be written in a dictionary rather than in a string. Building json strings manually can often be a cause of errors.
Switching to Python 3 would remove the need for separate encoding and conversion steps and generally simplify handling non-ascii data.
A pythonic way (tested with python 2.7) to filter the original dict is:
d1 = {'x-y': 3, 'ft': 9, 't-b': 7}
d2 = {k: v for k, v in d1.iteritems() if '-' in k}
print(d2)
Output
{'t-b': 7, 'x-y': 3}

How can I decode a string read from file?

I read an file into a string in Python, and it shows up as encoded (not sure the encoding).
query = ""
with open(file_path) as f:
for line in f.readlines():
print(line)
query += line
query
The lines all print out in English as expected
select * from table
but the query at the end shows up like
ÿþd\x00r\x00o\x00p\x00 \x00t\x00a\x00b\x00l\x00e\x00
What's going on?
Agreed with Carlos, the encoding seems to be UTF-16LE. There seems to be BOM present, thus encoding="utf-16" would be able to autodetect if it's little- or big-endian.
Idiomatic Python would be:
with open(file_path, encoding="...") as f:
for line in f:
# do something with this line
In your case, you append each line to query, thus entire code can be reduced to:
query = open(file_path, encoding="...").read()
It seems like UTF-16 data.
Can you try decoding it with utf-16?
with open(file_path) as f:
query=f.decode('utf-16')
print(query)
with open(filePath) as f:
fileContents = f.read()
if isinstance(fileContents, str):
fileContents = fileContents.decode('ascii', 'ignore').encode('ascii') #note: this removes the character and encodes back to string.
elif isinstance(fileContents, unicode):
fileContents = fileContents.encode('ascii', 'ignore')

python write umlauts into file

i have the following output, which i want to write into a file:
l = ["Bücher", "Hefte, "Mappen"]
i do it like:
f = codecs.open("testfile.txt", "a", stdout_encoding)
f.write(l)
f.close()
in my Textfile i want to see: ["Bücher", "Hefte, "Mappen"] instead of B\xc3\xbccher
Is there any way to do so without looping over the list and decode each item ? Like to give the write() function any parameter?
Many thanks
First, make sure you use unicode strings: add the "u" prefix to strings:
l = [u"Bücher", u"Hefte", u"Mappen"]
Then you can write or append to a file:
I recommend you to use the io module which is Python 2/3 compatible.
with io.open("testfile.txt", mode="a", encoding="UTF8") as fd:
for line in l:
fd.write(line + "\n")
To read your text file in one piece:
with io.open("testfile.txt", mode="r", encoding="UTF8") as fd:
content = fd.read()
The result content is an Unicode string.
If you decode this string using UTF8 encoding, you'll get bytes string like this:
b"B\xc3\xbccher"
Edit using writelines.
The method writelines() writes a sequence of strings to the file. The sequence can be any iterable object producing strings, typically a list of strings. There is no return value.
# add new lines
lines = [line + "\n" for line in l]
with io.open("testfile.txt", mode="a", encoding="UTF8") as fd:
fd.writelines(lines)

Categories

Resources