Replacing elements of a Python dictionary using regex

Replacing elements of a Python dictionary using regex - python

I have been trying to replace integer components of a dictionary with string values given in another dictionary. However, I am getting the following error:
Traceback (most recent call last):
File "<string>", line 11, in <module>
File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 14 (char 13)
The code has been given below. Not sure why I am getting an error.
import re
from json import loads, dumps
movable = {"movable": [0, 3, 6, 9], "fixed": [1, 4, 7, 10], "mixed": [2, 5, 8, 11]}
int_mapping = {0: "Ar", 1: "Ta", 2: "Ge", 3: "Ca", 4: "Le", 5: "Vi", 6: "Li", 7: "Sc", 8: "Sa", 9: "Ca", 10: "Aq", 11: "Pi"}
movable = dumps(movable)
for key in int_mapping.keys():
movable = re.sub('(?<![0-9])' + str(key) + '(?![0-9])', int_mapping[key], movable)
movable = loads(movable)
I understand that this code can easily be written in a different way to get the desired output. However, I am interested to understand what I am doing wrong.

If you print how movable looks like right before calling json.loads, you'll see what the problem is:
for key in int_mapping.keys():
movable = re.sub('(?<![0-9])' + str(key) + '(?![0-9])', int_mapping[key], movable)
print(movable)
outputs:
{"movable": [Ar, Ca, Li, Ca], "fixed": [Ta, Le, Sc, Aq], "mixed": [Ge, Vi, Sa, Pi]}
Those strings (Ar, Ca...) are not quoted, therefore it is not valid JSON.
If you choose to continue the way you're going, you must add the ":
movable = re.sub(
'(?<![0-9])' + str(key) + '(?![0-9])',
'"' + int_mapping[key] + '"',
movable)
(notice the '"' + int_mapping[key] + '"')
Which produces:
{"movable": ["Ar", "Ca", "Li", "Ca"], "fixed": ["Ta", "Le", "Sc", "Aq"], "mixed": ["Ge", "Vi", "Sa", "Pi"]}
This said... you are probably much better off by just walking the movable values and substituting them by the values in int_mapping. Something like:
mapped_movable = {}
for key, val in movable.items():
mapped_movable[key] = [int_mapping[v] for v in val]
print(mapped_movable)

You could use a dict comprehension and make the mapping replacements directly in Python:
...
movable = {
k: [int_mapping[v] for v in values]
for k, values in movable.items()
}
print(type(movable))
print(movable)
Out:
<type 'dict'>
{'mixed': ['Ge', 'Vi', 'Sa', 'Pi'], 'fixed': ['Ta', 'Le', 'Sc', 'Aq'], 'movable': ['Ar', 'Ca', 'Li', 'Ca']}

Related

trying to use getter and setter in a list of dictionaries functions in python but I get the same error

I have a simple program that has to delete some values that are between 2 given "days". For example, I have this list of dicts:
lst=[{"day": 1, "sum": 25, "type": 'in'}, {"day": 2, "sum": 55, "type": 'in'}, {"day": 3, "sum": 154, "type": 'out'}, {"day": 4, "sum": 99, "type": 'in'}]
and I wanna delete the values with "day" values between 1 and 3 and the output should be:
[{"day": 4, "sum": 99, "type": 'in'}]
Now I am using this program:
def delete_transaction_interval(all_transactions, dayStart, dayEnd):
for element in enumerate(all_transactions):
if get_transaction_day(all_transactions[element])>=dayStart and get_transaction_day(all_transactions[element])<=dayEnd:
new_list_transactions=all_transactions[:]
return new_list_transactions
but I want to use a getter function instead of all_transactions[i]["day"]. I already created the function:
def get_transaction_day(all_transactions):
return all_transactions["day"]
but I am using it I got this error:
list indices must be integers or slices, not tuple
and I don't know how to handle it because I do not see any tuple in my code TBH.
My version is:
def delete_transaction_interval(all_transactions, dayStart, dayEnd):
i=0
while i<=len(all_transactions)-1:
if get_transaction_day(all_transactions[i])>=dayStart and get_transaction_day(all_transactions[i])<=dayEnd:
new_transactions_list=all_transactions[:]
else:
i+=1
return new_transactions_list
Traceback:
Exception has occurred: TypeError
list indices must be integers or slices, not tuple
File "<String>", line 81, in delete_transaction_interval
if get_transaction_day(all_transactions[element])>=dayStart and get_transaction_day(all_transactions[element])<=dayEnd:
File "<String>", line 229, in test_delete_interval
delete_transaction_interval(all_transactions,1,3)
File "<String>", line 276, in test_all
test_delete_interval()
File "<String>", line 281, in <module>
test_all()
Can somebody help me with this, please?

Iterate over the list using a forloop if you want to get rid of the index
new_transactions_list = []
for elem in lst:
if not 1 <= elem["day"] <= 3 :
new_transactions_list.append(elem)
print(new_transactions_list)
Output
[{'day': 4, 'sum': 99, 'type': 'in'}]

If you are desperate to use this function:
def get_transaction_day(all_transactions):
return all_transactions["day"]
I have no problem with that. Small functions are good.
Its just that you must pass something in which works with ["day"]
Here are some examples:
def delete_transaction_interval(all_transactions, dayStart, dayEnd):
for element in enumerate(all_transactions):
if get_transaction_day(element[1])>=dayStart and get_transaction_day(element[1])<=dayEnd:
new_list_transactions=all_transactions[:]
return new_list_transactions
and from the other answer here:
def delete_transaction_interval(all_transactions, dayStart, dayEnd):
new_transactions_list = []
for elem in all_transactions:
if not dayStart <= get_transaction_day(elem) <= dayEnd:
new_transactions_list.append(elem)
return new_transactions
or a range() version:
def delete_transaction_interval(all_transactions, dayStart, dayEnd):
for i in range(len(all_transactions)):
if get_transaction_day(all_transactions[i])>=dayStart and get_transaction_day(all_transactions[i])<=dayEnd:
new_transactions_list=all_transactions[:]
return new_transactions_list
However, be warned, only one of these will work. The clue is to look in the original duplicate link which is here.

Pandas: reading indented JSON created by to_json

I'm writing JSON to a file using DataFrame.to_json() with the indent option:
df.to_json(path_or_buf=file_json, orient="records", lines=True, indent=2)
The important part here is indent=2, otherwise it works.
Then how do I read this file using DataFrame.read_json()?
I'm trying the code below, but it expects the file to be a JSON object per line, so the indentation messes things up:
df = pd.read_json(file_json, lines=True)
I didn't find any options in read_json to make it handle the indentation.
How else could I read this file created by to_json, possibly avoiding writing my own reader?

The combination of lines=True, orient='records', and indent=2 doesn't actually produce valid json.
lines=True is meant to create line-delimited json, but indent=2 adds extra lines. You can't have your delimiter be line breaks, AND have extra line breaks!
If you do just orient='records', and indent=2, then it does produce valid json.
The current read_json(lines=True) code can be found here:
def _combine_lines(self, lines) -> str:
"""
Combines a list of JSON objects into one JSON object.
"""
return (
f'[{",".join([line for line in (line.strip() for line in lines) if line])}]'
)
You can see that it expects to read the file line by line, which isn't possible when indent has been used.

The other answer is good, but it turned out it requires reading the entire file in memory. I ended up writing a simple lazy parser that I include below. It requires removing lines=True argument in df.to_json.
The use is following:
for obj, pos, length in lazy_read_json('file.json'):
print(obj['field']) # access json object
It includes pos - start position in file for the object, and length - the length of object in file; it allows some more functionality for me, like being able to index object and load them to memory on demand.
The parser is below:
def lazy_read_json(filename: str):
"""
:return generator returning (json_obj, pos, lenth)
>>> test_objs = [{'a': 11, 'b': 22, 'c': {'abc': 'z', 'zzz': {}}}, \
{'a': 31, 'b': 42, 'c': [{'abc': 'z', 'zzz': {}}]}, \
{'a': 55, 'b': 66, 'c': [{'abc': 'z'}, {'z': 3}, {'y': 3}]}, \
{'a': 71, 'b': 62, 'c': 63}]
>>> json_str = json.dumps(test_objs, indent=4, sort_keys=True)
>>> _create_file("/tmp/test.json", [json_str])
>>> g = lazy_read_json("/tmp/test.json")
>>> next(g)
({'a': 11, 'b': 22, 'c': {'abc': 'z', 'zzz': {}}}, 120, 116)
>>> next(g)
({'a': 31, 'b': 42, 'c': [{'abc': 'z', 'zzz': {}}]}, 274, 152)
>>> next(g)
({'a': 55, 'b': 66, 'c': [{'abc': 'z'}, {'z': 3}, {'y': 3}]}, 505, 229)
>>> next(g)
({'a': 71, 'b': 62, 'c': 63}, 567, 62)
>>> next(g)
Traceback (most recent call last):
...
StopIteration
"""
with open(filename) as fh:
state = 0
json_str = ''
cb_depth = 0 # curly brace depth
line = fh.readline()
while line:
if line[-1] == "\n":
line = line[:-1]
line_strip = line.strip()
if state == 0 and line == '[':
state = 1
pos = fh.tell()
elif state == 1 and line_strip == '{':
state = 2
json_str += line + "\n"
elif state == 2:
if len(line_strip) > 0 and line_strip[-1] == '{': # count nested objects
cb_depth += 1
json_str += line + "\n"
if cb_depth == 0 and (line_strip == '},' or line_strip == '}'):
# end of parsing an object
if json_str[-2:] == ",\n":
json_str = json_str[:-2] # remove trailing comma
state = 1
obj = json.loads(json_str)
yield obj, pos, len(json_str)
pos = fh.tell()
json_str = ""
elif line_strip == '}' or line_strip == '},':
cb_depth -= 1
line = fh.readline()
# this function is for doctest
def _create_file(filename, lines):
# cause doctest can't input new line characters :(
f = open(filename, "w")
for line in lines:
f.write(line)
f.write("\n")
f.close()

How to decode prefix codes efficiently with Python?

I am trying to decode a Huffman encoded string, with the following script:
def decode(input_string, lookup_table):
codes = lookup_table.keys()
result = bytearray()
i = 0
while len(input_string[i:]) != 0:
for code in codes:
if input_string[i:].startswith(code):
result.append(lookup_table[code])
i += len(code)
break
return ''.join((chr(x) for x in result))
lookup_table = {'10': 108, '00': 111, '0111': 114, '010': 119, '0110': 72, '1111': 33, '1110': 32, '1100': 100, '1101': 101}
input_string = '0110110110100011100100001111011001111'
print decode(input_string, lookup_table)
This script gives as output:
'Hello world!'
The script works for this small string, but decoding an encoded version of Hamlet takes 110 seconds. Is there a more efficient and faster way to do this?

Python how convert single quotes to double quotes to format as json string

I have a file where on each line I have text like this (representing cast of a film):
[{'cast_id': 23, 'character': "Roger 'Verbal' Kint", 'credit_id': '52fe4260c3a36847f8019af7', 'gender': 2, 'id': 1979, 'name': 'Kevin Spacey', 'order': 5, 'profile_path': '/x7wF050iuCASefLLG75s2uDPFUu.jpg'}, {'cast_id': 27, 'character': 'Edie's Finneran', 'credit_id': '52fe4260c3a36847f8019b07', 'gender': 1, 'id': 2179, 'name': 'Suzy Amis', 'order': 6, 'profile_path': '/b1pjkncyLuBtMUmqD1MztD2SG80.jpg'}]
I need to convert it in a valid json string, thus converting only the necessary single quotes to double quotes (e.g. the single quotes around word Verbal must not be converted, eventual apostrophes in the text also should not be converted).
I am using python 3.x. I need to find a regular expression which will convert only the right single quotes to double quotes, thus the whole text resulting in a valid json string. Any idea?

First of all, the line you gave as example is not parsable! … 'Edie's Finneran' … contains a syntax error, not matter what.
Assuming that you have control over the input, you could simply use eval() to read in the file. (Although, in that case one would wonder why you can't produce valid JSON in the first place…)
>>> f = open('list.txt', 'r')
>>> s = f.read().strip()
>>> l = eval(s)
>>> import pprint
>>> pprint.pprint(l)
[{'cast_id': 23,
'character': "Roger 'Verbal' Kint",
...
'profile_path': '/b1pjkncyLuBtMUmqD1MztD2SG80.jpg'}]
>>> import json
>>> json.dumps(l)
'[{"cast_id": 23, "character": "Roger \'Verbal\' Kint", "credit_id": "52fe4260ca36847f8019af7", "gender": 2, "id": 1979, "name": "Kevin Spacey", "order": 5, "rofile_path": "/x7wF050iuCASefLLG75s2uDPFUu.jpg"}, {"cast_id": 27, "character":"Edie\'s Finneran", "credit_id": "52fe4260c3a36847f8019b07", "gender": 1, "id":2179, "name": "Suzy Amis", "order": 6, "profile_path": "/b1pjkncyLuBtMUmqD1MztDSG80.jpg"}]'
If you don't have control over the input, this is very dangerous, as it opens you up to code injection attacks.
I cannot emphasize enough that the best solution would be to produce valid JSON in the first place.

If you do not have control over the JSON data, do not eval() it!
I created a simple JSON correction mechanism, as that is more secure:
def correctSingleQuoteJSON(s):
rstr = ""
escaped = False
for c in s:
if c == "'" and not escaped:
c = '"' # replace single with double quote
elif c == "'" and escaped:
rstr = rstr[:-1] # remove escape character before single quotes
elif c == '"':
c = '\\' + c # escape existing double quotes
escaped = (c == "\\") # check for an escape character
rstr += c # append the correct json
return rstr
You can use the function in the following way:
import json
singleQuoteJson = "[{'cast_id': 23, 'character': 'Roger \\'Verbal\\' Kint', 'credit_id': '52fe4260c3a36847f8019af7', 'gender': 2, 'id': 1979, 'name': 'Kevin Spacey', 'order': 5, 'profile_path': '/x7wF050iuCASefLLG75s2uDPFUu.jpg'}, {'cast_id': 27, 'character': 'Edie\\'s Finneran', 'credit_id': '52fe4260c3a36847f8019b07', 'gender': 1, 'id': 2179, 'name': 'Suzy Amis', 'order': 6, 'profile_path': '/b1pjkncyLuBtMUmqD1MztD2SG80.jpg'}]"
correctJson = correctSingleQuoteJSON(singleQuoteJson)
print(json.loads(correctJson))

Here is the code to get desired output
import ast
def getJson(filepath):
fr = open(filepath, 'r')
lines = []
for line in fr.readlines():
line_split = line.split(",")
set_line_split = []
for i in line_split:
i_split = i.split(":")
i_set_split = []
for split_i in i_split:
set_split_i = ""
rev = ""
i = 0
for ch in split_i:
if ch in ['\"','\'']:
set_split_i += ch
i += 1
break
else:
set_split_i += ch
i += 1
i_rev = (split_i[i:])[::-1]
state = False
for ch in i_rev:
if ch in ['\"','\''] and state == False:
rev += ch
state = True
elif ch in ['\"','\''] and state == True:
rev += ch+"\\"
else:
rev += ch
i_rev = rev[::-1]
set_split_i += i_rev
i_set_split.append(set_split_i)
set_line_split.append(":".join(i_set_split))
line_modified = ",".join(set_line_split)
lines.append(ast.literal_eval(str(line_modified)))
return lines
lines = getJson('test.txt')
for i in lines:
print(i)

Apart from eval() (mentioned in user3850's answer), you can use ast.literal_eval
This has been discussed in the thread: Using python's eval() vs. ast.literal_eval()?
You can also look at the following discussion threads from Kaggle competition which has data similar to the one mentioned by OP:
https://www.kaggle.com/c/tmdb-box-office-prediction/discussion/89313#latest-517927
https://www.kaggle.com/c/tmdb-box-office-prediction/discussion/80045#latest-518338

python json database corrupted

I got a bug in a huge database with thousand of lines, so i tried to remove thousands of them (obviously got a backup file) to lock out the problem and i end up with a problem. I'll provide a working without problem list in the database in case you want to compare:
#Working
["#saelyth", 8, 40, 4, "000", "000", "0", 11, "!Bot, me lees?", "legionanimenet", 0, "primermensajitodeesapersona"]
#Not working
["!anon7002", 545, 3166, 7, "000", "000", "0", 13, "\u2014\u00a1 Hijo! Estas calificaciones merecen una golpiza. \u2014\u00bf Verdad que si mam ...Vamos que yo se donde vive la maestra.", "legionanimenet", 0, "primermensajitodeesapersona"]
Causing this error:
ValueError: Extra data: line 1 column 240 - line 2 column 1 (char 239 - 366)
My question is: what is wrong there? i have no idea and all efforts to find what is the problem for json to give me such error are unsuccessful.
So i completely deleted that line and try to load the full database without that line but i also get now a new error:
ValueError: Expecting ',' delimiter: line 1 column 62 (char 61)
With so maaany and maaany and maaany records like:
["tyjyu", 59, 302, 19, "000", "000", "0", 13, "holas", "legionanimenet", 0, "primermensajitodeesapersona"]
["inuyacha64", 15944, 79401, 3496, "000", "F00", "0", 16, "cuidence chau", "legionanimenet", 2, "primermensajitodeesapersona"]
["!anon3573", 24, 140, "1", "nada", "nada", "nada", "nada", "nada", "legionanimenet", 0, "primermensajitodeesapersona"]
["eldiaoscuro", 74, 446, 16, "603", "369", "4", 13, "nada", "legionanimenet", 0, "primermensajitodeesapersona"]
What would be an efficient way to FIND the missing , giving me that error? And if possible i'd like to know also if json got a maximum number of items to load or something like that.
EDIT
The code to load info is:
data = []
with open('listas\Estadisticas.txt', 'r+') as f:
for line in f:
data_line = json.loads(line)
if data_line[0] == user.name and data_line[9] == "legionanimenet":
data_line[1] = int(data_line[1])+int(palabrasdelafrase)
data_line[2] = int(data_line[2])+int(letrasdelafrase)
data_line[3] = int(data_line[3])+1
data_line[4] = user.nameColor
data_line[5] = user.fontColor
data_line[6] = user.fontFace
data_line[7] = user.fontSize
data_line[11] = data_line[8]
data_line[8] = message.body
data_line[9] = "legionanimenet"
data.append(data_line)
f.seek(0)
f.writelines(["%s\n" % json.dumps(i) for i in data])
f.truncate()
I hope anyone can help me on that.
EDIT2: Python version is 3.3.2 IDLE

print(repr(data_line))
before loading the file will print it UNTIL find the error. I thank #nneonneo for that

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing elements of a Python dictionary using regex - python

Related

trying to use getter and setter in a list of dictionaries functions in python but I get the same error

Pandas: reading indented JSON created by to_json

How to decode prefix codes efficiently with Python?

Python how convert single quotes to double quotes to format as json string

python json database corrupted

Categories

Resources