python json database corrupted - python

I got a bug in a huge database with thousand of lines, so i tried to remove thousands of them (obviously got a backup file) to lock out the problem and i end up with a problem. I'll provide a working without problem list in the database in case you want to compare:
#Working
["#saelyth", 8, 40, 4, "000", "000", "0", 11, "!Bot, me lees?", "legionanimenet", 0, "primermensajitodeesapersona"]
#Not working
["!anon7002", 545, 3166, 7, "000", "000", "0", 13, "\u2014\u00a1 Hijo! Estas calificaciones merecen una golpiza. \u2014\u00bf Verdad que si mam ...Vamos que yo se donde vive la maestra.", "legionanimenet", 0, "primermensajitodeesapersona"]
Causing this error:
ValueError: Extra data: line 1 column 240 - line 2 column 1 (char 239 - 366)
My question is: what is wrong there? i have no idea and all efforts to find what is the problem for json to give me such error are unsuccessful.
So i completely deleted that line and try to load the full database without that line but i also get now a new error:
ValueError: Expecting ',' delimiter: line 1 column 62 (char 61)
With so maaany and maaany and maaany records like:
["tyjyu", 59, 302, 19, "000", "000", "0", 13, "holas", "legionanimenet", 0, "primermensajitodeesapersona"]
["inuyacha64", 15944, 79401, 3496, "000", "F00", "0", 16, "cuidence chau", "legionanimenet", 2, "primermensajitodeesapersona"]
["!anon3573", 24, 140, "1", "nada", "nada", "nada", "nada", "nada", "legionanimenet", 0, "primermensajitodeesapersona"]
["eldiaoscuro", 74, 446, 16, "603", "369", "4", 13, "nada", "legionanimenet", 0, "primermensajitodeesapersona"]
What would be an efficient way to FIND the missing , giving me that error? And if possible i'd like to know also if json got a maximum number of items to load or something like that.
EDIT
The code to load info is:
data = []
with open('listas\Estadisticas.txt', 'r+') as f:
for line in f:
data_line = json.loads(line)
if data_line[0] == user.name and data_line[9] == "legionanimenet":
data_line[1] = int(data_line[1])+int(palabrasdelafrase)
data_line[2] = int(data_line[2])+int(letrasdelafrase)
data_line[3] = int(data_line[3])+1
data_line[4] = user.nameColor
data_line[5] = user.fontColor
data_line[6] = user.fontFace
data_line[7] = user.fontSize
data_line[11] = data_line[8]
data_line[8] = message.body
data_line[9] = "legionanimenet"
data.append(data_line)
f.seek(0)
f.writelines(["%s\n" % json.dumps(i) for i in data])
f.truncate()
I hope anyone can help me on that.
EDIT2: Python version is 3.3.2 IDLE

print(repr(data_line))
before loading the file will print it UNTIL find the error. I thank #nneonneo for that

Related

Replacing elements of a Python dictionary using regex

I have been trying to replace integer components of a dictionary with string values given in another dictionary. However, I am getting the following error:
Traceback (most recent call last):
File "<string>", line 11, in <module>
File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 14 (char 13)
The code has been given below. Not sure why I am getting an error.
import re
from json import loads, dumps
movable = {"movable": [0, 3, 6, 9], "fixed": [1, 4, 7, 10], "mixed": [2, 5, 8, 11]}
int_mapping = {0: "Ar", 1: "Ta", 2: "Ge", 3: "Ca", 4: "Le", 5: "Vi", 6: "Li", 7: "Sc", 8: "Sa", 9: "Ca", 10: "Aq", 11: "Pi"}
movable = dumps(movable)
for key in int_mapping.keys():
movable = re.sub('(?<![0-9])' + str(key) + '(?![0-9])', int_mapping[key], movable)
movable = loads(movable)
I understand that this code can easily be written in a different way to get the desired output. However, I am interested to understand what I am doing wrong.
If you print how movable looks like right before calling json.loads, you'll see what the problem is:
for key in int_mapping.keys():
movable = re.sub('(?<![0-9])' + str(key) + '(?![0-9])', int_mapping[key], movable)
print(movable)
outputs:
{"movable": [Ar, Ca, Li, Ca], "fixed": [Ta, Le, Sc, Aq], "mixed": [Ge, Vi, Sa, Pi]}
Those strings (Ar, Ca...) are not quoted, therefore it is not valid JSON.
If you choose to continue the way you're going, you must add the ":
movable = re.sub(
'(?<![0-9])' + str(key) + '(?![0-9])',
'"' + int_mapping[key] + '"',
movable)
(notice the '"' + int_mapping[key] + '"')
Which produces:
{"movable": ["Ar", "Ca", "Li", "Ca"], "fixed": ["Ta", "Le", "Sc", "Aq"], "mixed": ["Ge", "Vi", "Sa", "Pi"]}
This said... you are probably much better off by just walking the movable values and substituting them by the values in int_mapping. Something like:
mapped_movable = {}
for key, val in movable.items():
mapped_movable[key] = [int_mapping[v] for v in val]
print(mapped_movable)
You could use a dict comprehension and make the mapping replacements directly in Python:
...
movable = {
k: [int_mapping[v] for v in values]
for k, values in movable.items()
}
print(type(movable))
print(movable)
Out:
<type 'dict'>
{'mixed': ['Ge', 'Vi', 'Sa', 'Pi'], 'fixed': ['Ta', 'Le', 'Sc', 'Aq'], 'movable': ['Ar', 'Ca', 'Li', 'Ca']}

boto3 glue get_job_runs - check execution with certain date exists in the response object

I am trying to fetch glue job executions that got failed previous day using 'get_job_runs' function available through boto3's glue client.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_job_runs.
The request syntax, does not have an option to filter executions or job runs by date/status -
response = client.get_job_runs(
JobName='string',
NextToken='string',
MaxResults=123
)
The response I receive back looks something like below -
{
"JobRuns": [
{
"Id": "jr_89bfa55b544f7eec4f6ea574dfb0345345uhi4df65e59869e93c5d8f5efef989",
"Attempt": 0,
"JobName": "GlueJobName",
"StartedOn": datetime.datetime(2021, 1, 27, 4, 32, 47, 718000, tzinfo=tzlocal()),
"LastModifiedOn": datetime.datetime(2021, 1, 27, 4, 36, 14, 975000, tzinfo=tzlocal()),
"CompletedOn": datetime.datetime(2021, 1, 27, 4, 36, 14, 975000, tzinfo=tzlocal()),
"JobRunState": "FAILED",
"Arguments": {
"--additional-python-modules": "awswrangler",
"--conf": "spark.executor.memory=40g",
"--conf ": "spark.driver.memory=40g",
"--enable-spark-ui": "true",
"--extra-py-files": "s3://GlueJobName/lttb.py",
"--job-bookmark-option": "job-bookmark-disable",
"--spark-event-logs-path": "s3://GlueJobName/glue-script/spark-event-logs"
},
"ErrorMessage": "MemoryError: Unable to allocate xxxxx",
"PredecessorRuns": [],
"AllocatedCapacity": 8,
"ExecutionTime": 199,
"Timeout": 2880,
"MaxCapacity": 8.0,
"WorkerType": "G.2X",
"NumberOfWorkers": 4,
"LogGroupName": "/aws-glue/jobs",
"GlueVersion": "2.0"
}
],
"NextToken": "string"
}
So, what I am doing now is looping through the response object to check if the "CompletedOn" date matches with yesterday's date using prev_day calculated using datetime and timedelta and I am doing this in a while loop to fetch last 10000 executions, as the 'get_job_runs' single call is insufficient.
import boto3
from datetime import datetime, timedelta
logger = logging.getLogger()
logger.setLevel(logging.INFO)
glue_client = boto3.client("glue")
def filter_failed_exec_prev_day(executions, prev_day) -> list:
filtered_resp = []
for execution in executions['JobRuns']:
if execution['JobRunState'] == 'FAILED' and execution['CompletedOn'].date() == prev_day:
filtered_resp.append(execution)
return filtered_resp
def get_final_executions() -> list:
final_job_runs_list = []
MAX_EXEC_SEARCH_CNT = 10000
prev_day = (datetime.utcnow() - timedelta(days=1)).date()
buff_exec_cnt = 0
l_job = 'GlueJobName'
response = glue_client.get_job_runs(
JobName=l_job
)
resp_count = len(response['JobRuns'])
if resp_count > 0:
buff_exec_cnt += resp_count
filtered_resp = filter_failed_exec_prev_day(response, prev_day)
final_job_runs_list.extend(filtered_resp)
while buff_exec_cnt <= MAX_EXEC_SEARCH_CNT:
if 'NextToken' in response:
response = glue_client.get_job_runs(
JobName=l_job
)
buff_exec_cnt += len(response['JobRuns'])
filtered_resp = filter_failed_exec_prev_day(response, prev_day)
final_job_runs_list.extend(filtered_resp)
else:
logger.info(f"{job} executions list: {final_job_runs_list}")
break
return final_job_runs_list
Here, I am using a while loop to break the call after hitting 10K executions, this is triple the amount of executions we see each day on the job.
Now, I am hoping to break the while loop after I encounter execution that belongs to prev_day - 1, so is it possible to search the entire response dict for prev_day - 1 to make sure all prev day's executions are covered considering the datetime.datetime object we receive from boto3 for CompletedOn attribute?
Appreciate reading through.
Thank you
I looked at your code. And I think it might return always the same result as you're not iterating through the resultset correctly.
here:
while buff_exec_cnt <= MAX_EXEC_SEARCH_CNT:
if 'NextToken' in response:
response = glue_client.get_job_runs(
JobName=l_job
)
you need to pass the NextToken value to the get_job_runs method, like this:
response = glue_client.get_job_runs(
JobName=l_job, NextToken= response['NextToken']
)

Write dataframe to text file as JSON-encoded dictionaries with Pandas

I have a large text file containing JSON-encoded dictionaries line by line.
{"a": 10, "b": 11, "c": 12, "d": 13, "e": 14, "f": 15, "g": 16, "h": 17, "i": 18, "j": 19}
{"a": 20, "b": 21, "c": 22, "d": 23, "e": 24, "f": 25, "g": 26, "h": 27, "i": 28, "j": 29}
...
I am using Pandas because it allows me to easily rename and reindex the dictionary keys.
with open("my_dictionaries.txt") as f:
my_dicts = [json.loads(line.strip()) for line in f]
df = pd.Dataframe(my_dicts)
df.rename(columns= ...)
df.reindex(columns= ...)
Now I want to write the altered dictionaries back to a text file, line by line, like the original example. I don't want to use pd.to_csv() because my data has some quirks that make a CSV more difficult to use. I have been experimenting with the pd.to_dict() and pd.to_json() methods but am a bit stuck.
Any suggestions?
You can use this:
import json
with open('output.txt', 'w') as f:
for row in df.to_dict('records'):
f.write(json.dumps(row) + '\n')
or:
import json
with open('output.txt', 'w') as f:
f.writelines([(json.dumps(r) + '\n') for r in df.to_dict('records')])

Python Binary and Regular String Confusion

I am sure this is a fairly noob question, I have googled, and cannot find a straight answer, but I may be asking the wrong thing... I am trying to make an Out Of Box Configuration script, and all the questions that need answered are stored in a file called pass.ini. When i get user input from getstr (using curses) when it populates my files, they all have b'variable string' as their values. when I try to do a strip command, I get b'riable strin'. when I do a str(variable) it gets the same issue. I saw where b'<variable_string>' can be a sign that it was in bytecode instead of decoded. so I tried a decode command and that failed as 'str object has no attribute 'decode' I have it writing out via ConfigParser, and to a separate file just as a file.write. Right now everything is commented out, I am out of ideas.
Here is the info gathering module:
wrapper(CommitChanges)
curses.echo()
stdscr.addstr( 8, 19, config.CIP , curses.color_pair(3) )
config.CIP = stdscr.getstr( 8, 19, 15)
stdscr.addstr( 9, 19, config.CSM , curses.color_pair(3) )
config.CSM = stdscr.getstr( 9, 19, 15)
stdscr.addstr( 10, 19, config.CGW , curses.color_pair(3) )
config.CGW = stdscr.getstr(10, 19, 15)
stdscr.addstr( 11, 19, config.CD1 , curses.color_pair(3) )
config.CD1 = stdscr.getstr(11, 19, 15)
stdscr.addstr( 12, 19, config.CD2 , curses.color_pair(3) )
config.CD2 = stdscr.getstr(12, 19, 15)
stdscr.addstr( 13, 19, config.CNTP, curses.color_pair(3) )
config.CNTP = stdscr.getstr(13, 19, 15)
stdscr.addstr( 16, 19, config.CHN , curses.color_pair(3) )
config.CHN = stdscr.getstr(16, 19, 15)
stdscr.addstr( 14, 19, config.CID , curses.color_pair(3) )
config.CID = stdscr.getstr(14, 19, 15)
stdscr.addstr( 15, 19, config.CS , curses.color_pair(3) )
config.CS = stdscr.getstr(15, 19, 15)
This is the file output module
def CommitChanges():
MOP = "X"
Config['Array=all']['PTLIP'] = a
Config['Array=all']['PTLSM'] = config.CSM.decode('utf-8')
Config['Array=all']['PTLGW'] = config.CGW.decode('utf-8')
Config['Array=all']['PTLD1'] = config.CD1.decode('utf-8')
Config['Array=all']['PTLD2'] = config.CD2.decode('utf-8')
Config['Array=all']['PTLNTP'] = config.CNTP.decode('utf-8')
Config['Array=all']['PTLIF'] = config.CIFN.decode('utf-8')
Config['Array=all']['PTLHSTNM'] = config.CHN.decode('utf-8')
Config['Array=all']['PTLMOB'] = config.CMOB.decode('utf-8')
Config['Array=all']['customerid'] = config.CID.decode('utf-8')
Config['Array=all']['site'] = config.CS.decode('utf-8')
with open('/opt/passp/pass.ini', 'w') as passini:
Config.write(passini, space_around_delimiters=False)
tpass= open('./pass.b', 'w')
tpass.write("[Array=All]"+ "\n")
tpass.write("ptlip="+ a + "\n")
tpass.write("ptlsm="+ config.CSM.decode('utf-8') +"\n")
tpass.write("ptlgw="+ config.CGW.decode('utf-8') + "\n")
tpass.write("ptld1="+ config.CD1.decode('utf-8') + "\n")
tpass.write("ptld2="+ config.CD2.decode('utf-8') + "\n")
tpass.write("ptlntp="+ config.CNTPdecode('utf-8') + "\n")
tpass.write("ptlif="+ config.CIFNdecode('utf-8') + "\n")
tpass.write("ptldhstnm="+ config.CHNdecode('utf-8') + "\n")
tpass.write("ptlmob="+ config.CMOBdecode('utf-8') + "\n")
tpass.write("customerid="+ config.CIDdecode('utf-8') + "\n")
tpass.write("site="+ config.CSdecode('utf-8') + "\n")
#if Backupfiles():
textchanges()
return
Here is the file save output created by ConfigParser
[Array=all]
ptlip=b'123'
ptlsm=b'321'
ptlgw=b'111'
ptld1=b'222'
ptld2=b'333'
ptlntp=b'444'
ptlif=s19
ptlhstnm=b'555'
ptlmob=
customerid=b'666'
site=b'777'
It perfectly matches when I do a direct write (they were from two different runs, but even with empty data it has the wrapper.
Interesting notice here, 'ptlif' is gathered from finding the interface name, it isn't handled by user input, so it has to be how the config.XXXX variables are stored.
[Array=All]
ptlip=b''
ptlsm=b''
ptlgw=b''
ptld1=b''
ptld2=b''
ptlntp=b''
ptlif=s19
ptldhstnm=b''
ptlmob=
customerid=b''
site=b''
Ok, so I figured out what my problem was.
ncurses getstr returns a binary value, and I was sending it to a string literal variable.
I created a new set of variables, set them up as
cipb = b'foo'
cipb = stdscr.getstr( y, x, len )
config.CIP = cipb.decode()
That fixed it. Hopefully this might help someone else out with a similar issue.

Porting Node.js script to Python

I want to port this Node.js script to control a Sky box over into Python, https://github.com/dalhundal/sky-remote/blob/master/sky-remote.js
I've gone through and done the best that I can do, code is below;
import time, math, socket, struct, time
from array import array
#sky q port 5900
class remote:
commands={"power": 0, "select": 1, "backup": 2, "dismiss": 2, "channelup": 6, "channeldown": 7, "interactive": 8, "sidebar": 8, "help": 9, "services": 10, "search": 10, "tvguide": 11, "home": 11, "i": 14, "text": 15, "up": 16, "down": 17, "left": 18, "right": 19, "red": 32, "green": 33, "yellow": 34, "blue": 35, 0: 48, 1: 49, 2: 50, 3: 51, 4: 52, 5: 53, 6: 54, 7: 55, 8: 56, 9: 57, "play": 64, "pause": 65, "stop": 66, "record": 67, "fastforward": 69, "rewind": 71, "boxoffice": 240, "sky": 241}
connectTimeout = 1000;
def __init__(self, ip, port=49160):
self.ip=ip
self.port=port
def showCommands(self):
for command, value in self.commands.iteritems():
print str(command)+ " : "+str(value)
def getCommand(self, code):
try:
return self.commands[code]
except:
print "Error: command '"+code+"' is not valid"
return False
def press (self, sequence):
if isinstance(sequence, list):
for item in sequence:
toSend=self.getCommand(item)
if toSend:
self.sendCommand(toSend)
time.sleep(0.5)
else:
toSend=self.getCommand(sequence)
if toSend:
self.sendCommand(toSend)
def sendCommand(self, code):
commandBytes = array('l', [4,1,0,0,0,0, int(math.floor(224 + (code/16))), code % 16])
try:
client=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
except socket.error, msg:
print 'Failed to create socket. Error code: ' + str(msg[0]) + ' , Error message : ' + msg[1]
return
try:
client.connect((self.ip, self.port))
except:
print "Failed to connect to client"
return
l=12
timeout=time.time()+self.connectTimeout
while 1:
data=client.recv(1024)
data=data
if len(data)<24:
client.sendall(data[0:l])
l=1
else:
client.sendall(buffer(commandBytes))
commandBytes[1]=0
client.sendall(buffer(commandBytes))
client.close()
break
if time.time() > timeout:
print "timeout error"
break
I think the issue is how I form the buffers? I'm not entirely sure as this is the first time I've dealt with buffers.
Having read through the Node.js documentation on new Buffer, it looks like it creates an array of Octets, whereas what I have is an array of ints, I may be wrong but an Octet is 8bits whilst an int is 4bits, I've tried changing the array to long and double, but this doesn't seem to resolve the issue
Having had the time to have a quick read through on how Node.js handles buffers and buffers in general and it looks like what I thought was right.
Change;
commandBytes = array('l', [4,1,0,0,0,0, int(math.floor(224 + (code/16))), code % 16])
to;
commandBytes = bytearray([4,1,0,0,0,0, int(math.floor(224 + (code/16))), code % 16])
Also, then just pass the bytearray;
client.sendall(commandBytes)
instead of;
client.sendall(buffer(commandBytes))

Categories

Resources