Write dataframe to text file as JSON-encoded dictionaries with Pandas - python

I have a large text file containing JSON-encoded dictionaries line by line.
{"a": 10, "b": 11, "c": 12, "d": 13, "e": 14, "f": 15, "g": 16, "h": 17, "i": 18, "j": 19}
{"a": 20, "b": 21, "c": 22, "d": 23, "e": 24, "f": 25, "g": 26, "h": 27, "i": 28, "j": 29}
...
I am using Pandas because it allows me to easily rename and reindex the dictionary keys.
with open("my_dictionaries.txt") as f:
my_dicts = [json.loads(line.strip()) for line in f]
df = pd.Dataframe(my_dicts)
df.rename(columns= ...)
df.reindex(columns= ...)
Now I want to write the altered dictionaries back to a text file, line by line, like the original example. I don't want to use pd.to_csv() because my data has some quirks that make a CSV more difficult to use. I have been experimenting with the pd.to_dict() and pd.to_json() methods but am a bit stuck.
Any suggestions?

You can use this:
import json
with open('output.txt', 'w') as f:
for row in df.to_dict('records'):
f.write(json.dumps(row) + '\n')
or:
import json
with open('output.txt', 'w') as f:
f.writelines([(json.dumps(r) + '\n') for r in df.to_dict('records')])

Related

Pandas: reading indented JSON created by to_json

I'm writing JSON to a file using DataFrame.to_json() with the indent option:
df.to_json(path_or_buf=file_json, orient="records", lines=True, indent=2)
The important part here is indent=2, otherwise it works.
Then how do I read this file using DataFrame.read_json()?
I'm trying the code below, but it expects the file to be a JSON object per line, so the indentation messes things up:
df = pd.read_json(file_json, lines=True)
I didn't find any options in read_json to make it handle the indentation.
How else could I read this file created by to_json, possibly avoiding writing my own reader?
The combination of lines=True, orient='records', and indent=2 doesn't actually produce valid json.
lines=True is meant to create line-delimited json, but indent=2 adds extra lines. You can't have your delimiter be line breaks, AND have extra line breaks!
If you do just orient='records', and indent=2, then it does produce valid json.
The current read_json(lines=True) code can be found here:
def _combine_lines(self, lines) -> str:
"""
Combines a list of JSON objects into one JSON object.
"""
return (
f'[{",".join([line for line in (line.strip() for line in lines) if line])}]'
)
You can see that it expects to read the file line by line, which isn't possible when indent has been used.
The other answer is good, but it turned out it requires reading the entire file in memory. I ended up writing a simple lazy parser that I include below. It requires removing lines=True argument in df.to_json.
The use is following:
for obj, pos, length in lazy_read_json('file.json'):
print(obj['field']) # access json object
It includes pos - start position in file for the object, and length - the length of object in file; it allows some more functionality for me, like being able to index object and load them to memory on demand.
The parser is below:
def lazy_read_json(filename: str):
"""
:return generator returning (json_obj, pos, lenth)
>>> test_objs = [{'a': 11, 'b': 22, 'c': {'abc': 'z', 'zzz': {}}}, \
{'a': 31, 'b': 42, 'c': [{'abc': 'z', 'zzz': {}}]}, \
{'a': 55, 'b': 66, 'c': [{'abc': 'z'}, {'z': 3}, {'y': 3}]}, \
{'a': 71, 'b': 62, 'c': 63}]
>>> json_str = json.dumps(test_objs, indent=4, sort_keys=True)
>>> _create_file("/tmp/test.json", [json_str])
>>> g = lazy_read_json("/tmp/test.json")
>>> next(g)
({'a': 11, 'b': 22, 'c': {'abc': 'z', 'zzz': {}}}, 120, 116)
>>> next(g)
({'a': 31, 'b': 42, 'c': [{'abc': 'z', 'zzz': {}}]}, 274, 152)
>>> next(g)
({'a': 55, 'b': 66, 'c': [{'abc': 'z'}, {'z': 3}, {'y': 3}]}, 505, 229)
>>> next(g)
({'a': 71, 'b': 62, 'c': 63}, 567, 62)
>>> next(g)
Traceback (most recent call last):
...
StopIteration
"""
with open(filename) as fh:
state = 0
json_str = ''
cb_depth = 0 # curly brace depth
line = fh.readline()
while line:
if line[-1] == "\n":
line = line[:-1]
line_strip = line.strip()
if state == 0 and line == '[':
state = 1
pos = fh.tell()
elif state == 1 and line_strip == '{':
state = 2
json_str += line + "\n"
elif state == 2:
if len(line_strip) > 0 and line_strip[-1] == '{': # count nested objects
cb_depth += 1
json_str += line + "\n"
if cb_depth == 0 and (line_strip == '},' or line_strip == '}'):
# end of parsing an object
if json_str[-2:] == ",\n":
json_str = json_str[:-2] # remove trailing comma
state = 1
obj = json.loads(json_str)
yield obj, pos, len(json_str)
pos = fh.tell()
json_str = ""
elif line_strip == '}' or line_strip == '},':
cb_depth -= 1
line = fh.readline()
# this function is for doctest
def _create_file(filename, lines):
# cause doctest can't input new line characters :(
f = open(filename, "w")
for line in lines:
f.write(line)
f.write("\n")
f.close()

how to get the same output by destructing in python

I want to get the same output as in the first code with the second code but with a slight variation. I want to de-structure the list "players" and then get the same output.
Here is the first code:
import random
lottery_numbers = set(random.sample(range(22), 6))
players = [
{"name": "Rolf", "numbers": {1, 3, 5, 7, 9, 11}},
{"name": "Charlie", "numbers": {2, 7, 9, 22, 10, 5}},
{"name": "Anna", "numbers": {13, 14, 15, 16, 17, 18}},
{"name": "Jen", "numbers": {19, 20, 12, 7, 3, 5}}
]
top_player = players[0]
for player in players:
matched_numbers = len(player["numbers"].intersection(lottery_numbers))
if matched_numbers > len(
top_player["numbers"].intersection(lottery_numbers)):
top_player = player
print(top_player)
I want to de-structure the list "players" to "name" and "player" and then compare the variable "player" with the numbers that matched with lottery_numbers
Here is the second piece of code:
for name, player in players[1].items():
matched_numbers = len(player.intersection(lottery_numbers))
if matched_numbers > len(
top_player["numbers"].intersection(lottery_numbers)):
top_player = player
print(top_player)
Pycharm seems to hit me with a error like this:
in matched_numbers = len(player.intersection(lottery_numbers))
AttributeError: 'str' object has no attribute 'intersection'
PS: I am pretty new with python and don't know much about what the error evens means..
The first thing you are doing is saying players[1].items(), this would return:
{"name": "Charlie", "numbers": {2, 7, 9, 22, 10, 5}}
because that is the first index of the players list. If you want to get the list of names, you can just do something like this:
for player in players:
this will go through all the players and the lottery numbers would just be player["numbers"]
So your final code would look like:
for player in players:
matched_numbers = len(player["numbers"].intersection(lottery_numbers))
if matched_numbers > len(
top_player["numbers"].intersection(lottery_numbers)):
top_player = player
print(top_player)
If you want to split this into 2 different variables, you can use this:
names = []
numbers = []
for player in players:
names.append(player["name"])
numbers.append(player["numbers"])
finalList = zip(names, numbers)
then do:
for name, player in finalList
A few changes to make, to do with how the players list is organised, ie. it's not a dict but rather a list.
import random
lottery_numbers = set(random.sample(range(22), 6))
players = [
{"name": "Rolf", "numbers": {1, 3, 5, 7, 9, 11}},
{"name": "Charlie", "numbers": {2, 7, 9, 22, 10, 5}},
{"name": "Anna", "numbers": {13, 14, 15, 16, 17, 18}},
{"name": "Jen", "numbers": {19, 20, 12, 7, 3, 5}}
]
top_player = players[0]
for player in players:
matched_numbers = len(player["numbers"].intersection(lottery_numbers))
if matched_numbers > len(
top_player["numbers"].intersection(lottery_numbers)):
top_player = player
print(top_player)
for player in players:
name = player["name"]
numbers = player["numbers"]
matched_numbers = len(numbers.intersection(lottery_numbers))
if matched_numbers > len(
top_player["numbers"].intersection(lottery_numbers)):
top_player = player
print(top_player)

Why are writes to python serial port output buffer captured by reads from input buffer

I trying to write commands to a python serial port, and
capture response from the connected device on the input
channel of the port buffer.
I need to capture the reads, as and when they occur, so I create a
thread which reads if the input buffer is none empty.
When I run the program I see the data I output to the output
buffer on the input buffer of the serial port.
The code I created is shown below:
import threading
import serial
import json
import time
def handle_data(data):
print(data)
def read_from_port(serPort):
while True:
if serPort.in_waiting > 0:
# print("in:", serPort.in_waiting)
reading = serPort.read(serPort.in_waiting).decode('ascii')
handle_data(reading)
time.sleep(0.1)
ser = serial.Serial(
port='COM15',
baudrate=115200,
timeout=0
)
thread = threading.Thread(target=read_from_port, args=(ser,))
thread.start()
print(ser.name)
jsonDict = {
'c': 120,
'i': 0,
'p': '',
}
i = 1
while i < 1000:
jsonDict["i"] = i
if i%2 == 0:
cmd = (2 << 8) | 0
payload = "local"
else:
cmd = (1 << 8) | 100
payload = "remote"
jsonDict["c"] = cmd
jsonDict["p"] = payload
output = json.dumps(jsonDict) + '\r'
ser.write(output)
i = i + 1
time.sleep(1)
ser.close()
To test the python functionality, I'm currently routing
data sent to the connected device with payload "local", back to the pc.
However the read thread is capturing both the the data that is sent and the data which is returned, see below:
COM15
{"i": 1, "p": "remote", "c": 356}
{"i": 2, "p": "local", "c": 512}
{"i": 2, "p": "local", "c": 512}
{"i": 3, "p": "remote", "c": 356}
{"i": 4, "p": "local", "c": 512}
{"i": 4, "p": "local", "c": 512}
{"i": 5, "p": "remote", "c": 356}
{"i": 6, "p": "local", "c": 512}
{"i": 6, "p": "local", "c": 512}
{"i": 7, "p": "remote", "c": 356}
{"i": 8, "p": "local", "c": 512}
{"i": 8, "p": "local", "c": 512}
{"i": 9, "p": "remote", "c": 356}
Any thoughts!
The problem is not in your code. You probably can open a Putty or Tera-Term terminal and see the exact same behavior - some devices have the option to echo back the sent commands (this is helpful when working from terminal).
You didn't said what device you are using but this is probably can be configured - search for 'echo' configuration on your device.
More exotic problem might be Tx line shorted to the Rx line somewhere along the connection wire but I'll check this only after eliminating the echo configuration.
If the echo can't be configured you'll just have to deal with it in code.

Python how convert single quotes to double quotes to format as json string

I have a file where on each line I have text like this (representing cast of a film):
[{'cast_id': 23, 'character': "Roger 'Verbal' Kint", 'credit_id': '52fe4260c3a36847f8019af7', 'gender': 2, 'id': 1979, 'name': 'Kevin Spacey', 'order': 5, 'profile_path': '/x7wF050iuCASefLLG75s2uDPFUu.jpg'}, {'cast_id': 27, 'character': 'Edie's Finneran', 'credit_id': '52fe4260c3a36847f8019b07', 'gender': 1, 'id': 2179, 'name': 'Suzy Amis', 'order': 6, 'profile_path': '/b1pjkncyLuBtMUmqD1MztD2SG80.jpg'}]
I need to convert it in a valid json string, thus converting only the necessary single quotes to double quotes (e.g. the single quotes around word Verbal must not be converted, eventual apostrophes in the text also should not be converted).
I am using python 3.x. I need to find a regular expression which will convert only the right single quotes to double quotes, thus the whole text resulting in a valid json string. Any idea?
First of all, the line you gave as example is not parsable! … 'Edie's Finneran' … contains a syntax error, not matter what.
Assuming that you have control over the input, you could simply use eval() to read in the file. (Although, in that case one would wonder why you can't produce valid JSON in the first place…)
>>> f = open('list.txt', 'r')
>>> s = f.read().strip()
>>> l = eval(s)
>>> import pprint
>>> pprint.pprint(l)
[{'cast_id': 23,
'character': "Roger 'Verbal' Kint",
...
'profile_path': '/b1pjkncyLuBtMUmqD1MztD2SG80.jpg'}]
>>> import json
>>> json.dumps(l)
'[{"cast_id": 23, "character": "Roger \'Verbal\' Kint", "credit_id": "52fe4260ca36847f8019af7", "gender": 2, "id": 1979, "name": "Kevin Spacey", "order": 5, "rofile_path": "/x7wF050iuCASefLLG75s2uDPFUu.jpg"}, {"cast_id": 27, "character":"Edie\'s Finneran", "credit_id": "52fe4260c3a36847f8019b07", "gender": 1, "id":2179, "name": "Suzy Amis", "order": 6, "profile_path": "/b1pjkncyLuBtMUmqD1MztDSG80.jpg"}]'
If you don't have control over the input, this is very dangerous, as it opens you up to code injection attacks.
I cannot emphasize enough that the best solution would be to produce valid JSON in the first place.
If you do not have control over the JSON data, do not eval() it!
I created a simple JSON correction mechanism, as that is more secure:
def correctSingleQuoteJSON(s):
rstr = ""
escaped = False
for c in s:
if c == "'" and not escaped:
c = '"' # replace single with double quote
elif c == "'" and escaped:
rstr = rstr[:-1] # remove escape character before single quotes
elif c == '"':
c = '\\' + c # escape existing double quotes
escaped = (c == "\\") # check for an escape character
rstr += c # append the correct json
return rstr
You can use the function in the following way:
import json
singleQuoteJson = "[{'cast_id': 23, 'character': 'Roger \\'Verbal\\' Kint', 'credit_id': '52fe4260c3a36847f8019af7', 'gender': 2, 'id': 1979, 'name': 'Kevin Spacey', 'order': 5, 'profile_path': '/x7wF050iuCASefLLG75s2uDPFUu.jpg'}, {'cast_id': 27, 'character': 'Edie\\'s Finneran', 'credit_id': '52fe4260c3a36847f8019b07', 'gender': 1, 'id': 2179, 'name': 'Suzy Amis', 'order': 6, 'profile_path': '/b1pjkncyLuBtMUmqD1MztD2SG80.jpg'}]"
correctJson = correctSingleQuoteJSON(singleQuoteJson)
print(json.loads(correctJson))
Here is the code to get desired output
import ast
def getJson(filepath):
fr = open(filepath, 'r')
lines = []
for line in fr.readlines():
line_split = line.split(",")
set_line_split = []
for i in line_split:
i_split = i.split(":")
i_set_split = []
for split_i in i_split:
set_split_i = ""
rev = ""
i = 0
for ch in split_i:
if ch in ['\"','\'']:
set_split_i += ch
i += 1
break
else:
set_split_i += ch
i += 1
i_rev = (split_i[i:])[::-1]
state = False
for ch in i_rev:
if ch in ['\"','\''] and state == False:
rev += ch
state = True
elif ch in ['\"','\''] and state == True:
rev += ch+"\\"
else:
rev += ch
i_rev = rev[::-1]
set_split_i += i_rev
i_set_split.append(set_split_i)
set_line_split.append(":".join(i_set_split))
line_modified = ",".join(set_line_split)
lines.append(ast.literal_eval(str(line_modified)))
return lines
lines = getJson('test.txt')
for i in lines:
print(i)
Apart from eval() (mentioned in user3850's answer), you can use ast.literal_eval
This has been discussed in the thread: Using python's eval() vs. ast.literal_eval()?
You can also look at the following discussion threads from Kaggle competition which has data similar to the one mentioned by OP:
https://www.kaggle.com/c/tmdb-box-office-prediction/discussion/89313#latest-517927
https://www.kaggle.com/c/tmdb-box-office-prediction/discussion/80045#latest-518338

python json database corrupted

I got a bug in a huge database with thousand of lines, so i tried to remove thousands of them (obviously got a backup file) to lock out the problem and i end up with a problem. I'll provide a working without problem list in the database in case you want to compare:
#Working
["#saelyth", 8, 40, 4, "000", "000", "0", 11, "!Bot, me lees?", "legionanimenet", 0, "primermensajitodeesapersona"]
#Not working
["!anon7002", 545, 3166, 7, "000", "000", "0", 13, "\u2014\u00a1 Hijo! Estas calificaciones merecen una golpiza. \u2014\u00bf Verdad que si mam ...Vamos que yo se donde vive la maestra.", "legionanimenet", 0, "primermensajitodeesapersona"]
Causing this error:
ValueError: Extra data: line 1 column 240 - line 2 column 1 (char 239 - 366)
My question is: what is wrong there? i have no idea and all efforts to find what is the problem for json to give me such error are unsuccessful.
So i completely deleted that line and try to load the full database without that line but i also get now a new error:
ValueError: Expecting ',' delimiter: line 1 column 62 (char 61)
With so maaany and maaany and maaany records like:
["tyjyu", 59, 302, 19, "000", "000", "0", 13, "holas", "legionanimenet", 0, "primermensajitodeesapersona"]
["inuyacha64", 15944, 79401, 3496, "000", "F00", "0", 16, "cuidence chau", "legionanimenet", 2, "primermensajitodeesapersona"]
["!anon3573", 24, 140, "1", "nada", "nada", "nada", "nada", "nada", "legionanimenet", 0, "primermensajitodeesapersona"]
["eldiaoscuro", 74, 446, 16, "603", "369", "4", 13, "nada", "legionanimenet", 0, "primermensajitodeesapersona"]
What would be an efficient way to FIND the missing , giving me that error? And if possible i'd like to know also if json got a maximum number of items to load or something like that.
EDIT
The code to load info is:
data = []
with open('listas\Estadisticas.txt', 'r+') as f:
for line in f:
data_line = json.loads(line)
if data_line[0] == user.name and data_line[9] == "legionanimenet":
data_line[1] = int(data_line[1])+int(palabrasdelafrase)
data_line[2] = int(data_line[2])+int(letrasdelafrase)
data_line[3] = int(data_line[3])+1
data_line[4] = user.nameColor
data_line[5] = user.fontColor
data_line[6] = user.fontFace
data_line[7] = user.fontSize
data_line[11] = data_line[8]
data_line[8] = message.body
data_line[9] = "legionanimenet"
data.append(data_line)
f.seek(0)
f.writelines(["%s\n" % json.dumps(i) for i in data])
f.truncate()
I hope anyone can help me on that.
EDIT2: Python version is 3.3.2 IDLE
print(repr(data_line))
before loading the file will print it UNTIL find the error. I thank #nneonneo for that

Categories

Resources