python write german umlaute into a file

python write german umlaute into a file - python

I know, this question has been asked million times. But I am still stuck with it. I am using python 2 and cannot change to python 3.
problem is this:
>>> w = u"ümlaut"
>>> w
>>> u'\xfcmlaut'
>>> print w
ümlaut
>>> dic = {'key': w}
>>> dic
{'key': u'\xfcmlaut'}
>>> f = io.open('testtt.sql', mode='a', encoding='UTF8')
>>> f.write(u'%s' % dic)
then file has:
{'key': u'\xfcmlaut'}
I need {'key': 'ümlaut'} or {'key': u'ümlaut'}
What am I missing, I am still noob in encoding decoding things :/

I'm not sure why you want this format particularly, since it won't be valid to read into any other application, but never mind.
The problem is that to write a dictionary to a file, it needs to be converted to a string - and to do that, Python calls repr on all its elements.
If you create the output manually as a string, all is well:
d = "{'key': '%s'}" % w
with io.open('testtt.sql', mode='a', encoding='UTF8') as f:
f.write(d)

The easiest solution is to switch to python3, but since you can't do that, how about converting dictionary to json first before trying to save it into the file.
import io
import json
import sys
reload(sys)
sys.setdefaultencoding('utf8')
w = u"ümlaut"
dic = {'key': w}
f = io.open('testtt.sql', mode='a', encoding='utf8')
f.write(unicode(json.dumps(dic, ensure_ascii=False).encode('utf8')))

Related

How could I read a dictionary that contains a function from a text file?

I want to read a dictionary from a text file. This dictionary seems like {'key': [1, ord('#')]}. I read about eval() and literal_eval(), but none of those two will work due to ord().
I also tried json.loads and json.dumps, but no positive results.
Which other way could I use to do it?

So Assuming you read the text file in with open as a string and not with json.loads you could do some simple regex searching for what is between the parenthesis of ord e.g ord('#') -> #
This is a minimal solution that reads everything from the file as a single string then finds all instances of ord and places the integer representation in an output list called ord_. For testing this example myfile.txt was a text file with the following in it
{"key": [1, "ord('#')"],
"key2": [1, "ord('K')"]}
import json
import re
with open(r"myfile.txt") as f:
json_ = "".join([line.rstrip("\n") for line in f])
rgx = re.compile(r"ord\(([^\)]+)\)")
rgd = rgx.findall(json_)
ord_ = [ord(str_.replace(r"'", "")) for str_ in rgd]

json.dump() and json.load() will not work because ord() is not JSON Serializable (meaning that the function cannot be a JSON object.
Yes, eval is really bad practice, I would never recommend it to anyone for any use.
The best way I can think of to solve this is to use conditions and an extra list.
# data.json = {'key': [1, ['ord', '#']]} # first one is function name, second is arg
with open("data.json") as f:
data = json.load(f)
# data['key'][1][0] is "ord"
if data['key'][1][0] == "ord":
res = ord(data['key'][1][1])

pythonic way of iterating over a collection of json objects stored in a text file

I have a text file that has several thousand json objects (meaning the textual representation of json) one after the other. They're not separated and I would prefer not to modify the source file. How can I load/parse each json in python? (I have seen this question, but if I'm not mistaken, this only works for a list of jsons (alreay separated by a comma?) My file looks like this:
{"json":1}{"json":2}{"json":3}{"json":4}{"json":5}...

I don't see a clean way to do this without using the real JSON parser. The other options of modifying the text and using a non-JSON parser are risky. So the best way to go it find a way to iterate using the real JSON parser so that you're sure to comply with the JSON spec.
The core idea is to let the real JSON parser do all the work in identifying the groups:
import json, re
combined = '{"json":1}{"json":2}{"json":3}{"json":4}{"json":5}'
start = 0
while start != len(combined):
try:
json.loads(combined[start:])
except ValueError as e:
pass
# Find the location where the parsing failed
end = start + int(re.search(r'column (\d+)', e.args[0]).group(1)) - 1
result = json.loads(combined[start:end])
start = end
print(result)
This outputs:
{u'json': 1}
{u'json': 2}
{u'json': 3}
{u'json': 4}
{u'json': 5}

I think the following would work as long as there are no non-comma-delimited json arrays of json sub-objects inside any of the outermost json objects. It's somewhat brute-force in that it reads the whole file into memory and attempts to fix it.
import json
def get_json_array(filename):
with open(filename, 'rt') as jsonfile:
json_array = '[{}]'.format(jsonfile.read().replace('}{', '},{'))
return json.loads(json_array)
for obj in get_json_array('multiobj.json'):
print(obj)
Output:
{u'json': 1}
{u'json': 2}
{u'json': 3}
{u'json': 4}
{u'json': 5}

Instead of modifying the source file, just make a copy. Use a regex to replace }{ with },{ and then hopefully a pre-built json reader will take care of it nicely.
EDIT: quick solution:
from re import sub
with open(inputfile, 'r') as fin:
text = sub(r'}{', r'},{', fin.read())
with open(outfile, 'w' as fout:
fout.write('[')
fout.write(text)
fout.write(']')

>>> import ast
>>> s = '{"json":1}{"json":2}{"json":3}{"json":4}{"json":5}'
>>> [ast.literal_eval(ele + '}') for ele in s.split('}')[:-1]]
[{'json': 1}, {'json': 2}, {'json': 3}, {'json': 4}, {'json': 5}]
Provided you have no nested objects and splitting on '}' is feasible this can be accomplished pretty simply.

Here is one pythonic way to do it:
from json.scanner import make_scanner
from json import JSONDecoder
def load_jsons(multi_json_str):
s = multi_json_str.strip()
scanner = make_scanner(JSONDecoder())
idx = 0
objects = []
while idx < len(s):
obj, idx = scanner(s, idx)
objects.append(obj)
return objects
I think json was never supposed to be used this way, but it solves your problem.
I agree with #Raymond Hettinger, you need to use json itself to do the work, text manipulation doesn't work for complex JSON objects. His answer parses the exception message to find the split position. It works, but it looks like a hack, hence, not pythonic :)
EDIT:
Just found out this is actually supported by json module, just use raw_decode like this:
decoder = JSONDecoder()
first_obj, remaining = decoder.raw_decode(multi_json_str)
Read http://pymotw.com/2/json/index.html#mixed-data-streams

In python, is there a way to extract a embedded json string?

So I'm parsing a really big log file with some embedded json.
So I'll see lines like this
foo="{my_object:foo, bar:baz}" a=b c=d
The problem is that since the internal json can have spaces, but outside of the JSON, spaces act as tuple delimiters (except where they have unquoted strings . Huzzah for whatever idiot thought that was a good idea), I'm not sure how to figure out where the end of the JSON string is without reimplementing large portions of a json parser.
Is there a json parser for Python where I can give it '{"my_object":"foo", "bar":"baz"} asdfasdf', and it can return ({'my_object' : 'foo', 'bar':'baz'}, 'asdfasdf') or am I going to have to reimplement the json parser by hand?

Found a really cool answer. Use json.JSONDecoder's scan_once function
In [30]: import json
In [31]: d = json.JSONDecoder()
In [32]: my_string = 'key="{"foo":"bar"}"more_gibberish'
In [33]: d.scan_once(my_string, 5)
Out[33]: ({u'foo': u'bar'}, 18)
In [37]: my_string[18:]
Out[37]: '"more_gibberish'
Just be careful
In [38]: d.scan_once(my_string, 6)
Out[38]: (u'foo', 11)

Match everything around it.
>>> re.search('^foo="(.*)" a=.+ c=.+$', 'foo="{my_object:foo, bar:baz}" a=b c=d').group(1)
'{my_object:foo, bar:baz}'

Use shlex and json.
Something like:
import shlex
import json
def decode_line(line):
decoded = {}
fields = shlex.split(line)
for f in fields:
k, v = f.split('=', 1)
if k == "foo":
v = json.loads(v)
decoded[k] = v
return decoded
This does assume that the JSON inside the quotes is quoted properly.
Here's a short example program that uses the above:
import pipes
testdict = {"hello": "world", "foo": "bar"}
line = 'foo=' + pipes.quote(json.dumps(testdict)) + ' a=b c=d'
print line
print decode_line(line)
With output:
foo='{"foo": "bar", "hello": "world"}' a=b c=d
{'a': 'b', 'c': 'd', 'foo': {u'foo': u'bar', u'hello': u'world'}}

string to OrderedDict conversion in python

i have created a python Ordered Dictionary by importing collections and stored it in a file named 'filename.txt'. the file content looks like
OrderedDict([(7, 0), (6, 1), (5, 2), (4, 3)])
i need to make use of this OrderedDict from another program. i do it as
myfile = open('filename.txt','r')
mydict = myfile.read()
i need to get 'mydict' as of Type
<class 'collections.OrderedDict'>
but here, it comes out to be of type 'str'.
is there any way in python to convert a string type to OrderedDict type? using python 2.7

You could store and load it with pickle
import cPickle as pickle
# store:
with open("filename.pickle", "w") as fp:
pickle.dump(ordered_dict, fp)
# read:
with open("filename.pickle") as fp:
ordered_dict = pickle.load(fp)
type(ordered_dict) # <class 'collections.OrderedDict'>

The best solution here is to store your data in a different way. Encode it into JSON, for example.
You could also use the pickle module as explained in other answers, but this has potential security issues (as explained with eval() below) - so only use this solution if you know that the data is always going to be trusted.
If you can't change the format of the data, then there are other solutions.
The really bad solution is to use eval() to do this. This is a really really bad idea as it's insecure, as any code put in the file will be run, along with other reasons
The better solution is to manually parse the file. The upside is that there is a way you can cheat at this and do it a little more easily. Python has ast.literal_eval() which allows you to parse literals easily. While this isn't a literal as it uses OrderedDict, we can extract the list literal and parse that.
E.g: (untested)
import re
import ast
import collections
with open(filename.txt) as file:
line = next(file)
values = re.search(r"OrderedDict\((.*)\)", line).group(1)
mydict = collections.OrderedDict(ast.literal_eval(values))

This is not a good solution but it works. :)
#######################################
# String_To_OrderedDict
# Convert String to OrderedDict
# Example String
# txt = "OrderedDict([('width', '600'), ('height', '100'), ('left', '1250'), ('top', '980'), ('starttime', '4000'), ('stoptime', '8000'), ('startani', 'random'), ('zindex', '995'), ('type', 'text'), ('title', '#WXR##TU##Izmir##brief_txt#'), ('backgroundcolor', 'N'), ('borderstyle', 'solid'), ('bordercolor', 'N'), ('fontsize', '35'), ('fontfamily', 'Ubuntu Mono'), ('textalign', 'right'), ('color', '#c99a16')])"
#######################################
def string_to_ordereddict(txt):
from collections import OrderedDict
import re
tempDict = OrderedDict()
od_start = "OrderedDict([";
od_end = '])';
first_index = txt.find(od_start)
last_index = txt.rfind(od_end)
new_txt = txt[first_index+len(od_start):last_index]
pattern = r"(\(\'\S+\'\,\ \'\S+\'\))"
all_variables = re.findall(pattern, new_txt)
for str_variable in all_variables:
data = str_variable.split("', '")
key = data[0].replace("('", "")
value = data[1].replace("')", "")
#print "key : %s" % (key)
#print "value : %s" % (value)
tempDict[key] = value
#print tempDict
#print tempDict['title']
return tempDict

Here's how I did it on Python 2.7
from collections import OrderedDict
from ast import literal_eval
# Read in string from text file
myfile = open('filename.txt','r')
file_str = myfile.read()
# Remove ordered dict syntax from string by indexing
file_str=file_str[13:]
file_str=file_str[:-2]
# convert string to list
file_list=literal_eval(file_str)
header=OrderedDict()
for entry in file_list:
# Extract key and value from each tuple
key, value=entry
# Create entry in OrderedDict
header[key]=value
Again, you should probably write your text file differently.

how to encode a paragraph for use in a CSV file in python

I am completely new to python and struggling to make a simple thing work.
I am reading a bunch of information from a Web service, parsing the results, and I want to write it out into a flat-file. Most of my items are single line items, but one of the things I get back from my Web service is a paragraph. The paragraph will contain newlines, quotes, and any random characters.
I was going to use the CSV module for python, but unsure of the parameters I should use and how to escape my string so the paragraph gets put onto a single line and so I am guaranteed all characters are properly escaped (especially the delimiter)

The default csv.writer setup should handle this properly. Here's a simple example:
import csv
myparagraph = """
this is a long paragraph, with "quotes" and stuff.
"""
mycsv = csv.writer(open('foo.csv', 'wb'))
mycsv.writerow([myparagraph, 'word1'])
mycsv.writerow(['word2', 'word3'])
This yields the following csv file:
"
this is a long paragraph, with ""quotes"" and stuff.
",word1
word2,word3
Which should load into your favorite csv opening tool with no problems, as a having two rows and two columns.

You don't have to do anything special. The CSV module takes care of the quoting for you.
>>> from StringIO import StringIO
>>> s = StringIO()
>>> w = csv.writer(s)
>>> w.writerow(['the\nquick\t\r\nbrown,fox\\', 32])
>>> s.getvalue()
'"the\nquick\t\r\nbrown,fox\\",32\r\n'
>>> s.seek(0)
>>> r = csv.reader(s)
>>> next(r)
['the\nquick\t\r\nbrown,fox\\', '32']

To help with setting your expectations, the following is executable pseudocode for how the quoting etc works in the de-facto standard CSV output:
>>> def csv_output_record(input_row):
... delimiter = ','
... q = '"' # quotechar
... quotables = set([delimiter, '\r', '\n'])
... return delimiter.join(
... q + value.replace(q, q + q) + q if q in value
... else q + value + q if any(c in quotables for c in value)
... else value
... for value in input_row
... ) + '\r\n'
...
>>> csv_output_record(['foo', 'x,y,z', 'Jack "Ripper" Jones', 'top\nmid\nbot'])
'foo,"x,y,z","Jack ""Ripper"" Jones","top\nmid\nbot"\r\n'
Note that there is no escaping, only quoting, and hence if the quotechar appears in the field, it must be doubled.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python write german umlaute into a file - python

Related

How could I read a dictionary that contains a function from a text file?

pythonic way of iterating over a collection of json objects stored in a text file

In python, is there a way to extract a embedded json string?

string to OrderedDict conversion in python

how to encode a paragraph for use in a CSV file in python

Categories

Resources