Python: JSON to CSV - special array handling

Python: JSON to CSV - special array handling - python

I am using this JSON to CSV Converter to convert my JSON data to CSV, which I can further work on in Excel:
https://github.com/vinay20045/json-to-csv
The structure of my JSON data looks like following: https://pastebin.com/rPkqcXiF
{
"page": 1,
"pages": 1,
"limit": 100,
"total": 20,
"items": [
{...}
]
}
In line 64 there is an array of items. First item is shown from line 65 to 92.
The next array with the same content would then be started in line 93, when available.
My problem now is: I fetch 2 datasets from the REST API.
One of those datasets has an items array of 2 items. Then the python script will generate further columns for new items. First array is items_0, next is items_1 and so on.
Example where you can see that I mean, with formatting for view in Excel: Pastebin EqGHX07U (only 2 links allowed here)
Instead of generating new columns when the amount of array elements rise, I'd like to have only one set of columns in the header of the csv. When the amount of array elements rise, there should be a new line generated with all other data like before - only the data of the new array changes.
Example where you can see that I mean, with formatting for view in Excel: Pastebin QLnaiqDs (only 2 links allowed here)
It would be awesome if you could help me out here! A few hints how to solve that are highly appreciated - I am not used to python though :(
Thank you so much!

If i understood correctly, here you have an approach. Think about it:
headers = []
csv_text = ""
def add_empty_fields():
"""Adds ';' char in each line in order to have empty fields for each new header"""
csv_lines = csv_text.split("\n")
csv_text = ""
for line in csv_lines[:-1]:
csv_text+=line+";\n"
csv_text+=csv_lines[-1]
def add_ordered_data(json_parsed_to_dict):
#Get headers of json
tmp_list = set(json_parsed_to_dict.keys())
#Add ordered data with headers list
for h in headers:
if h in tmp_list:
tmp_list.discard(h)
csv_text+=json_parsed_to_dict[h]+";"
else:
csv_text+=";"
#Add new headers behind it
for new_header in tmp_list:
headers.append(new_header)
csv_text+=json_parsed_to_dict[h]+";"
#add_empty_fields()
""" You can do csv_lines.replace("\n",";\n") here instead of add_empty_fields hahah """
csv_text+="\n"
I wrote all directly here, probably it have some fails but I hope this helps you

Apart from handling JSON data with node element. The script can also handle a JSON array without a node element as input. Refer readme file and sample_2 in the repo.
So, you can pre-process the input, to get all the items from both the API and merge them before feeding the list to the converter. Like...
data_to_be_processed = items_from_api_1 + items_from_api_2
You could write this pre-processor as a standalone module or modify my script just after line 77.
Hope it helps...

Related

How to parse JSON with a list of lists?

I am trying to parse a "complicated" JSON string that is returned to me by an API.
It looks like this:
{
"data":[
["Distance to last strike","23.0","miles"],
["Time of last strike","1/14/2022 9:23:42 AM",""],
["Number of strikes today","1",""]
]
}
While the end goal will be to extract the distance, date/time, as well as count, for right now I am just trying to successfully get the distance.
My python script is:
import requests
import json
response_API = requests.get('http://localhost:8998/api/extra/lightning.json')
data = response_API.text
parse_json = json.loads(data)
value = parse_json['Distance to last strike']
print(value)
This does not work. If I change the value line to
value = parse_json['data']
then the entire string I listed above is returned.
I am hoping it's just a simple formatting issue. Suggestions?

You have an object with a list of lists. If you fetch
value = parse_json['data']
Then you will have a list containing three lists. So:
print(value[0][1])
will print "23.0".

Assign a label to each element of an array in Python

Hi So basically I got 2 arrays. For the sake of simplicity the following:
array_notepad = []
array_images = []
Some magic happens and they are populated, i.e. data is loaded, for array_notepad data is read from a notepad file whilst array_images is populated with the RGB values from a folder containing images.
How do I use array_notepad as a label of array_images?
i.e. the label of array_images[0] is array_notepad[0], array_images[1] is array_notepad[1], array_images[1] is array_notepad[1], and so on until array_images[999] is array_notepad[999]
If it makes any difference I am using glob and cv2 to read the image data, whilst normal python file reader to read the content in the notepad.
Thanks a lot for your help!

Your question isn't entirely clear on what your expected output should be. You mention 'label' - to me it sounds like you're describing key-value pairs i.e. a dictionary.
In which case you should be able to use the zip function as described in this question: Convert two lists into a dictionary

I hope you want to create a dictionary from 2 lists. If so you could do as follows.
array_notepad = ['label1', 'label2', 'label3']
array_images = ['rgb1', 'rgb2', 'rgb3']
d = { label: value for label, value in zip(array_notepad, array_images) }
d

Reading JSON / array of JSONs from set of files

I have a series of files, each with a JSON, that I'd like to read into Pandas. This is pretty straightforward:
data_unfiltered = [json.load(open(jf)) for jf in json_input_files]
# next call used to be df = pandas.DataFrame(data_unfiltered)
# instead, json_normalize flattens nested dicts
df = json_normalize(data_unfiltered)
Now, a new wrinkle. Some of these input files no longer have just a plain JSON but instead a (python) list of JSONs: [ { JSON }, { JSON }... ].
json.load is pretty great because it inputs a whole file and puts it straight into JSON; I don't have to parse the file at all. How would I now turn a list of files, some of which have a single JSON object and some of which have a list of JSON objects, into a list of parsed JSON objects?
Bonus question: I used to be able to add the filename into each JSON with
df['filename'] = pandas.Series(json_input_files).values
but now I can't do that any more since now one input file might correspond to many JSONs in the output list. How can I add the filenames to the JSONs before I put them into a pandas dataframe?
Edit: Work in progress, per request in comments:
data_unfiltered = []
for jf_file in json_input_files:
jf = open(jf_file)
if isinstance(jf, list): # this is obviously wrong
for j in jf:
d = json.load(j) # this is what I need to fix
d['details'] = jf_file
data_unfiltered.append(d)
else: # not a list, assume dict
d = json.load(jf)
d['details'] = jf_file
data_unfiltered.append(d)
but json.load() worked perfectly for what I wanted (file object to JSON) and has no equiv for arrays. I figure I have to manually parse the file into a list of blobs and then do json.loads() on each blob? That's pretty kludgey though.

How to vectorize a json dictionary using R wrapped in python?

High level description of what I want: I want to be able to receive a json response detailing certain values of fields/features, say {a: 1, b:2, c:3} as a flask (json) request. Then I want to convert the resulting python_dict into an r dataframe with rpy2(a single row of one), and feed it into a model in R which is expecting to receive a set of input where each column is a factor in r. I usually use python for this sort of thing, and serialize a vectorizer object from sklearn -- but this particular analysis needs to be done an R.
So here is what I'm doing so far.
import rpy2.robjects as robjects
from rpy2.robjects.packages import STAP
model = os.path.join('model', 'rsource_file.R')
with open(model, 'r') as f:
string = f.read()
model = STAP(string, "model")
data_r = robjects.DataFrame(data)
data_factored = model.prepdata(data_r)
result = model.predict(data_factored)
the relevant r functions from rsource_code are:
prepdata = function(row){
for(v in vars) if(typeof(row[,v])=="character") row[,v] = as.factor(row[,v], levs[0,v])
modm2=model.matrix(frm, data=tdz2, contrasts.arg = c1,xlev = levs)
}
where contrasts and levels have been pre-extracted from an existing dataset likeso:
#vars = vector of columns of interest
load(data.Rd)
for(v in vars) if(typeof(data[,v])=="character") data[,v] = as.factor(data[,v])
frm = ~ weightedsum_of_things #function mapped, causes no issue
modm= (model.matrix(frm,data=data))
levs = lapply(data, levels)
c1 = attributes(modm)$contrasts
calling prepdata does not give me what I want, which is for the newly dataframe(from the json request data_r) to be properly turned into a vector of "factors" with the same encoding by which the elements of the data.Rd database where transformed.
Thank you for your assistance, will upvote.
More detail: So what my code is attempting to do is map the labels() method over a the dataset to extract a list of lists of possible "levels" for a factor -- and then for matching values in the new input, call factor() with the new data row as well as the corresponding set of levels, levs[0,v].
This throws an error that you can't use factor if there isn't more than one level. I think this might have something to do with the labels/level difference? I'm calling levs[,v] to get the element of the return value of lapply(data, levels) corresponding to the "title" v (a string). I extracted the levelsfrom the data set -- but referencing them in the body of prep_data this way doesn't seem to work. Do I need to extract labels instead? if so, how can I do that?

Best way to parse a file with columns that randomly change order before importing it into SQL Server 2008?

I have a file that has columns that look like this:
Column1,Column2,Column3,Column4,Column5,Column6
1,2,3,4,5,6
1,2,3,4,5,6
1,2,3,4,5,6
1,2,3,4,5,6
1,2,3,4,5,6
1,2,3,4,5,6
Column1,Column3,Column2,Column6,Column5,Column4
1,3,2,6,5,4
1,3,2,6,5,4
1,3,2,6,5,4
Column2,Column3,Column4,Column5,Column6,Column1
2,3,4,5,6,1
2,3,4,5,6,1
2,3,4,5,6,1
The columns randomly re-order in the middle of the file, and the only way to know the order is to look at the last set of headers right before the data (Column1,Column2, etc.) (I've also simplified the data so that it's easier to picture. In real life, there is no way to tell data apart as they are all large integer values that could really go into any column)
Obviously this isn't very SQL Server friendly when it comes to using BULK INSERT, so I need to find a way to arrange all of the columns in a consistent order that matches my table's column order in my SQL database. What's the best way to do this? I've heard Python is the language to use, but I have never worked with it. Any suggestions/sample scripts in any language are appreciated.

A solution in python:
I would read line-by-line and look for headers. When I find a header, I use it to figure out the order (somehow). Then I pass that order to itemgetter which will do the magic of reordering elements:
from operator import itemgetter
def header_parse(line,order_dict):
header_info = line.split(',')
indices = [None] * len(header_info)
for i,col_name in enumerate(header_info):
indices[order_dict[col_name]] = i
return indices
def fix(fname,foutname):
with open(fname) as f,open(foutname,'w') as fout:
#Assume first line is a "header" and gives the order to use for the
#rest of the file
line = f.readline()
order_dict = dict((name,i) for i,name in enumerate(line.strip().split(',')))
reorder_magic = itemgetter(*header_parse(line.strip(),order_dict))
for line in f:
if line.startswith('Column'): #somehow determine if this is a "header"
reorder_magic = itemgetter(*header_parse(line.strip(),order_dict))
else:
fout.write(','.join(reorder_magic(line.strip().split(','))) + '\n')
if __name__ == '__main__':
import sys
fix(sys.argv[1],sys.argv[2])
Now you can call it as:
python fixscript.py badfile goodfile

Since you didn't mention a specific problem, I'm going to assume you're having problems coming up with an algorithm.
For each row,
Parse the row into fields.
If it's the first header line,
Output the header.
Create a map of field names to position.
%map = map { $fields[$_] => $_ } 0..$#fields;
Create a map of original positions to new positions.
#map = #map{ #fields };
If it's a header line other than the first,
Update map of original positions to new positions.
#map = #map{ #fields };
If it's not a header line,
Reorder fields.
#fields[ #map ] = #fields;
Output the row.
(Snippets are in Perl.)

This can be fixed easily in two steps:
split file into multiple files when a new header starts
read each file using csv dict reader, sort the keys and re-output rows in correct order
Here is an example how you can ho about it,
def is_header(line):
return line.find('Column') >= 0
def process(lines):
headers = None
for line in lines:
line = line.strip()
if is_header(line):
headers = list(enumerate(line.split(",")))
headers_map = dict(headers)
headers.sort(key=lambda (i,v):headers_map[i])
print ",".join([h for i,h in headers])
continue
values = list(enumerate(line.split(",")))
values.sort(key=lambda (i,v):headers_map[i])
print ",".join([v for i,v in values])
if __name__ == "__main__":
import sys
process(open(sys.argv[1]))
You can also change function is_header to correctly identify header in real cases

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: JSON to CSV - special array handling - python

Related

How to parse JSON with a list of lists?

Assign a label to each element of an array in Python

Reading JSON / array of JSONs from set of files

How to vectorize a json dictionary using R wrapped in python?

Best way to parse a file with columns that randomly change order before importing it into SQL Server 2008?

Categories

Resources